Nvidia Details Grace Hopper CPU Superchip Design: 144 Cores on 4N TSMC Process
Ahead of next week’s Hot Chips 34 presentation, Nvidia has released new details about its Grace CPU Superchip, revealing that the chip will be manufactured on its 4N process. Nvidia also shared more information about its architecture and data fabric, along with performance and efficiency benchmarks. Nvidia hasn’t made a formal presentation at Hot Chips yet — more details will be added after the session — but information shared today broadens the road to Grace chips and servers first to market. Shows half of 2023.
Remember easily. Nvidia’s Grace CPU is the company’s first CPU-only Arm chip designed for the data center, with a total of 144 cores as his two chips on a single motherboard. The Grace Hopper Superchip, on the other hand, is a combination of a Hopper GPU and a Grace CPU. same board.
Among the most important disclosures, Nvidia has finally officially confirmed that its Grace CPUs are using the TSMC 4N process. TSMC List the ‘N4’ 4nm process under the 5nm node family., describing it as an enhanced version of the 5nm node. Nvidia uses a special variant of this node called ‘4N’ that is specially optimized for GPUs and CPUs.
These types of specialized nodes are becoming more common as Moore’s Law weakens and transistors become harder and more costly to shrink with each new node. To enable custom process nodes like Nvidia’s 4N, chip designers and foundries use design engineering co-optimization (DTCO) to create custom power, performance, and area (PPA) for specific products. They work together by adjusting their traits.
Nvidia has previously revealed that it is using off-the-shelf Arm Neoverse cores for its Grace CPUs, but has yet to identify the specific version they are using. However, Nvidia has revealed that Grace is using Arm v9 cores to support his SVE2, and the Neoverse N2 platform will be Arm’s first to support Arm v9 and SVE2-like extensions. IP. The N2 Perseus platform will come as a 5nm design (remember, the N4 belongs to TSMC’s 5nm family) and will support PCIe Gen 5.0, DDR5, HBM3, CCIX 2.0, and CXL 2.0. The Perseus design is optimized for performance per watt and performance per area. Arm has said that the company’s next-gen core, his Poseidon, won’t hit the market until his 2024, and given Grace’s early 2023 launch date, these cores are likely candidates. sex is lower.
Nvidia Grace Hopper CPU Architecture
Nvidia’s new Nvidia Scalable Coherency Fabric (SCF) is a mesh interconnect very similar to the standard CMN-700 Coherent Mesh Network used in Arm Neoverse cores.
The Nvidia SCF offers 3.2 TB/s of bisection bandwidth between the various Grace chip units such as CPU cores, memory and I/O, not to mention the NVLink-C2C interface that connects the chip to other units. There is none. Motherboard, whether it’s another Grace CPU or a Hopper GPU.
Mesh supports over 72 cores and each CPU has a total of 117 MB of L3 cache. Nvidia states that the album’s first block diagram above is “a possible topology for illustrative purposes” and that its arrangement does not exactly match his second diagram.
This diagram shows a chip with 8 SCF cache partitions (SCCs) that look like L3 cache slices (more on that in the presentation) and 8 CPU units (these look like clusters of cores). is showing. The SCCs and cores are connected in two groups to Cache Switch Nodes (CSNs), which reside in the SCF mesh fabric and provide the interface between the CPU cores and memory to the rest of the chip. SCF also supports coherence across up to four sockets using Coherent NVLink.
Nvidia also shared this diagram, showing that each Grace CPU supports up to 68 PCIe lanes and up to 4 PCIe 5.0 x16 connections. Each x16 connection supports up to 128 GB/s bi-directional throughput (an x16 link can be split into two x8 links). We also see 16 dual-channel LPDDR5X memory controllers (MC).
However, this diagram is different from the first one. The L3 cache is shown as two contiguous blocks attached to a quad-core CPU cluster. This makes a lot more sense than the previous figure, with a total of up to 72 cores in the chip. However, the first diagram does not show his separate SCF partitions or his CSN nodes, which is a bit confusing. We will discuss this during the presentation and update as necessary.
Nvidia says its Scalable Coherency Fabric (SCF) is a proprietary design, while Arm allows partners to tune core counts, cache sizes, and use different types of memory such as DDR5 and HBM for CMN- We are making the 700 mesh customizable. Choose from different interfaces such as PCIe 5.0, CXL, CCIX. This means Nvidia may use a highly customized CMN-700 implementation for the on-die fabric.
Nvidia Grace Hopper Enhanced GPU Memory
GPUs focus on memory throughput, so Nvidia naturally looked to improve memory throughput not only within the chip, but between the CPU and GPU. The Grace CPU has 16 dual-channel LPDDR5X memory controllers with 32 channels supporting up to 512 GB of memory and up to 546 GB/s throughput. Nvidia says he chose LPDDR5X over HBM2e due to multiple factors such as capacity and cost. LPDDR5X, on the other hand, offers 53% more bandwidth and 1/8th the power per GB compared to standard his DDR5 memory, making it the overall better choice.
Nvidia is also introducing Enhanced GPU Memory (EGM). This allows any Hopper GPU on the NVLink network to access his LPDDR5X memory on any Grace CPU on the network, while preserving his native NVLink performance.
Nvidia’s goal is to provide a unified pool of memory that can be shared between the CPU and GPU. This allows for higher performance while simplifying the programming model. Grace Hopper CPU+GPU chips support unified memory with shared page tables. This means that the chip can share its address space and page tables with CUDA apps and allocate GPU memory using the system allocator. It also supports native atomics between CPU and GPU.
NVIDIA NVLink-C2C
While the CPU core is the computing engine, the interconnect is the battleground that defines the future of computing. Moving data consumes more power than actually computing the data, so moving data faster and more efficiently or avoiding data transfers are important goals.
Nvidia’s Grace CPU, which consists of two CPUs on one board, and the Grace Hopper Superchip, which consists of one Grace CPU and one Hopper GPU on the same board, transfer data between units via a proprietary NVLink chip. designed to maximize Provides to-Chip (C2C) interconnect and memory coherency to reduce or eliminate data transfers.
interconnect | picojoule per bit (pJ/b) |
NVLink-C2C | 1.3pJ/b |
UCIe | 0.5 to 0.25pJ/b |
infinity fabric | ~1.5 pJ/b |
TSMC CoWoS | 0.56 pJ/b |
Phoberos | 0.2 pJ/b |
EMIB | 0.3 pJ/b |
Bundle of Wire (BoW) | 0.7 to 0.5pJ/b |
On-die | 0.1pJ/b |
Nvidia has shared new details about its NVLink-C2C interconnect. This is a die-to-die and chip-to-chip interconnect that supports memory coherency and provides up to 900 GB/s throughput (7x the bandwidth of a PCIe 5.0 x16 link). This interface uses the NVLink protocol and Nvidia has created the interface using SERDES and LINK design technology with a focus on energy and area efficiency. However, NVLink-C2C also supports industry standard protocols such as CXL and Arm’s AMBA Coherent Hub Interface (CHI — the key to Neoverse CMN-700 mesh). It also supports many types of connections, from PCB-based interconnects to silicon interposers to wafer-scale implementations.
Power efficiency is a key metric for any data fabric, and today Nvidia shared that the link consumes 1.3 picojoules (pJ/b) per bit of data transferred. This is five times more efficient than the PCIe 5.0 interface, but more than twice the power (0.5-0.25 pJ/b) of future UCIe interconnects on the market. Package types vary, and C2C Link offers Nvidia a solid blend of performance and efficiency for specific use cases, but as you can see in the table above, the more advanced options offer higher levels of Provides power efficiency.
Nvidia Grace CPU Benchmark
Nvidia has shared more performance benchmarks, but like all vendor-provided performance data, these numbers should be considered carefully. These benchmarks also have the added caveat that they are run pre-silicon. That is, emulated predictions that have not yet been tested on real silicon and are “subject to change”. Therefore, sprinkle with extra salt.
Nvidia’s new benchmark here is a score of 370 on one Grace CPU in the SpecIntRate 2017 benchmark. This puts the chip right in the range we’d expect: Nvidia already shared a multi-CPU benchmark, claiming a score of 740. 2 SpecIntRate2017 Benchmark Grace CPU. Clearly, this suggests an improvement in linear scaling with his two chips.
AMD’s current generation EPYC Milan chips, the current performance leader in data centers, scored SPEC results ranging from 382 to 424 respectively. However, Nvidia’s solution has many other advantages, such as power efficiency and a GPU-friendly design.
Nvidia shared their memory throughput benchmarks, showing that the Grace CPU can deliver up to 500 GB/s throughput in the CPU memory throughput test. Nvidia also claims the chip can push up to 506 GB/s total read/write throughput to the attached hopper GPU, with 429 GB/s during read throughput tests and 407 GB/s for writes. I measured the bandwidth from CPU to GPU. .
Grace Hopper is arm system compatible
Nvidia also announced that its Grace CPU Superchip will comply with the necessary requirements to achieve System Ready certification. This certification means that Arm chips “just work” with the operating system and software, making deployment easier. Grace also supports virtualization extensions such as nested virtualization and S-EL2 support. Nvidia also lists support for:
- RAS v1.1 Generic Interrupt Controller (GIC) v4.1
- Memory Partitioning and Monitoring (MPAM)
- System Memory Management Unit (SMMU) v3.1
- Arm Server Base System Architecture (SBSA) enables standards-compliant hardware and software interfaces. Additionally, Grace CPUs are designed to support the Arm Server Base Boot Requirements (SBBR) to enable standard boot flows on Grace CPU-based systems.
- For cache and bandwidth partitioning, and bandwidth monitoring, Grace CPUs also support Arm Memory Partitioning and Monitoring (MPAM). Grace CPUs also include the Arm Performance Monitoring Unit, which allows you to monitor the performance of CPU cores and other subsystems within your system-on-chip (SoC) architecture. This allows standard tools such as Linux perf to be used for performance investigations.
Nvidia’s Grace CPUs and Grace Hopper Superchips on Track for Early 2023 Release, with Hopper Variants for AI Training, Inference, and HPC and Dual-CPU Grace Systems for HPC and Cloud Computing Workloads Designed to