Intel Fires Up Xeon Max CPUs, GPUs To Rival AMD, Nvidia
A few days before Supercomputing 22 began, Intel introduced (opens in new tab) Its next-generation Xeon Max CPU (previously codenamed Sapphire Rapids HBM) and data center GPU Max series compute GPU (known as Ponte Vecchio). These new products address a wide variety of high-performance computing workloads or work together to solve the most complex supercomputing tasks.
Xeon Max CPUs: Sapphire Rapids Gets 64 GB HBM2E
General-purpose x86 processors have been used for decades for virtually every kind of technical computation, and they support many applications. However, while the performance of general-purpose CPU cores has expanded rapidly over the years, today’s processors impose two significant limitations on the performance of artificial intelligence and HPC workloads: parallelism and memory bandwidth. there is. Intel’s Xeon Max ‘Sapphire Rapids HBM’ processors promise to remove both boundaries.
Intel’s Xeon Max processors feature up to 56 high-performance Golden Cove cores (distributed across four chiplets interconnected using Intel’s EMIB technology) and multiple processors for AI and HPC workloads. It’s further enhanced with an accelerator engine and 64GB of on-package HBM2E memory. Like other Sapphire Rapids CPUs, the Xeon Max will continue to support 8 channels of DDR5 memory and PCIe Gen 5 interfaces, and will have the CXL 1.1 protocol on top, so you can use any CXL capable accelerator if you so desire. .
In addition to supporting vector AVX-512 and Deep Learning Boost (AVX512_VNNI and AVX512_BF16) accelerators, the new cores also offer an Advanced Matrix Extensions (AMX) tile matrix multiplication accelerator. This is basically a grid of fused multiply-add units supporting BF16 and INT8. An input type that can be programmed using only 12 instructions and can perform up to 1024 TMUL BF16 or 2048 TMUL INT8 operations per cycle per core. The new CPUs also support Data Streaming Accelerator (DSA), offloading the workload of copying and transforming data from the CPU.
64 GB of on-package HBM2E memory (4 stacks of 16 GB) provides a peak bandwidth of around 1 TB/s. This is 18.28 GB/s per core, equivalent to HBM2E at ~1.14 GB per core. To put the numbers into context, a 56-core Sapphire Rapids processor with eight DDR5-4800 modules gets a maximum bandwidth of 307.2 GB/s. This means 5.485 GB/s per core. Xeon Max, on the other hand, can use HBM2E memory in different ways. Use as system memory that does not require any code changes. Use as a high-performance cache for DDR5 memory subsystems that do not require modified code. As part of a unified memory pool (HBM flat mode) with software optimizations.
Depending on the workload, Intel’s AMX-enabled Xeon Max processors can deliver a 3x to 5.3x performance improvement over the currently available Xeon Scalable 8380 processors using traditional FP32 processing for the same workload. . Meanwhile, for applications such as molecular dynamics model development, the new HBM2E-powered CPU is up to 2.8x faster than AMD’s EPYC 7773X, which features 3D V-Cache.
But HBM2E has another important implication for Intel as it reduces some of the overhead of moving data between CPUs and GPUs, which is essential for various HPC workloads. Let’s take a look at his second of today’s announcements, the Data Center GPU Max Series Compute GPUs.
The Data Center GPU Max: The Pinnacle of Intel’s Data Center Innovation
Intel’s Data Center GPU Max computing GPU series will feature the company’s codenamed Ponte Vecchio architecture and will be first introduced in 2019 with more details in 2020-2021. Intel’s Ponte Vecchio is the most complex processor ever created. memory) 47 tiles or more (including 8 HBM2E tiles). In addition, this product makes extensive use of Intel’s advanced packaging technologies (such as EMIB). This is because different tiles are manufactured by other manufacturers using different process technologies.
Intel’s Data Center GPU Max computing GPUs rely on the company’s Xe-HPC architecture, which is explicitly tuned for AI and HPC workloads, so it requires the proper data formats and instructions, as well as 512-bit vectors and 4096-bit matrices. (tensor) engine support.
Data centers up to 1100 | Data Center up to 1350 | Data center up to 1550 | AMD Instinct MI250X | NVIDIA H100 | NVIDIA H100 | rialto bridge | |
---|---|---|---|---|---|---|---|
form factor | PCIe | OAM | OAM | OAM | SXM | PCIe | OAM |
tile + memory | ? | ? | 39+8 | 2+8 | 1+6 | 1+6 | many |
transistor | ? | ? | 100 billion | 58 billion | 80 billion | 80 billion | load of them |
Xe HPC Core | Compute Unit | 56 | 112 | 128 | 220 | 132 | 114 | 160 enhanced Xe HPC cores |
RT core | 56 | 112 | 128 | – | – | – | ? |
512-bit vector engine | 448 | 896 | 1024 | ? | ? | ? | ? |
4096-bit matrix engine | 448 | 896 | 1024 | ? | ? | ? | ? |
L1 cache | ? | ? | 64 MB at 105 TB/s | ? | ? | ? | ? |
L2 Rambo Cache | ? | ? | 408 MB at 13 TB/s | ? | 50MB | 50MB | ? |
HBM2E | 48GB | 96GB | 128 GB at 3.2 TB/s | 128 GB/s at 3.2 TB/s | 80 GB at 3.35 TB/s | 8 GB at 2 TB/s | ? |
Multi-GPU IO | 8 | 16 | 16 | 8 | 8 | 8 | ? |
Power | 300W | 450W | 600W | 560W | 700W | 350W | 800W |
Compared to Xe-HPG, Xe-HPC has a much more sophisticated memory and caching subsystem, a different configuration of Xe cores (each Xe-HPG core has 16 256-bit vector engines and 16 1024 bit matrix engine, but each Xe-HPC core has 8 engines). 512-bit vector engine and eight 4096-bit vector engines). Additionally, Xe-HPC GPUs do not have a texturing unit or rendering backend, so graphics cannot be rendered using traditional methods. Xe-HPG, on the other hand, surprisingly supports ray tracing for supercomputer visualization.
One of the most important elements of Xe-HPC is Intel’s Xe Matrix Extensions (XMX). This results in some pretty formidable tensor/matrix performance of Intel’s Data Center GPU Max 1550 (see table below) — up to 419 TF32 TFLOPS and up to 1678 INT8 TOPS, according to Intel. Of course, the peak performance numbers provided by compute GPU developers are important, but they may not reflect the performance achievable on real supercomputers in real-world applications. Still, for the most part, Intel’s range-topping Ponte Vecchio lags far behind his Nvidia’s H100 and in all cases except the FP32 Tensor (TF32) has a visible advantage over AMD’s Instinct MI250X I can’t help but notice that I haven’t been able to provide
Data center up to 1550 | AMD Instinct MI250X | NVIDIA H100 | NVIDIA H100 | |
---|---|---|---|---|
form factor | OAM | OAM | SXM | PCIe |
HBM2E | 128 GB at 3.2 TB/s | 128 GB/s at 3.2 TB/s | 80 GB at 3.35 TB/s | 80 GB at 2 TB/s |
Power | 600W | 560W | 700W | 350W |
Peak INT8 Vector | ? | 383 Tops | 133.8 TFLOPS | 102.4TFLOPS |
Peak FP16 Vector | 104 TFLOPS | 383 TFLOPS | 134 TFLOPS | 102.4TFLOPS |
Peak BF16 Vector | ? | 383 TFLOPS | 133.8 TFLOPS | 102.4TFLOPS |
Peak FP32 Vector | 52 TFLOPS | 47.9 TFLOPS | 67 TFLOPS | 51 TFLOPS |
Peak FP64 Vector | 52 TFLOPS | 47.9 TFLOPS | 34 TFLOPS | 26 TFLOPS |
peak INT8 tensor | 1678 Tops | ? | 1979 Tops | 3958 Tops* | 1513 Tops | 3026 Tops* |
peak FP16 tensor | 839 TFLOPS | ? | 989 TFLOPS | 1979 TFLOPS* | 756TFLOPS | 1513TFLOPS* |
peak BF16 tensor | 839 TFLOPS | ? | 989 TFLOPS | 1979 TFLOPS* | 756TFLOPS | 1513TFLOPS* |
peak FP32 tensor | 419 TFLOPS | 95.7 TFLOPS | 989 TFLOPS | 756 TFLOPS |
peak FP64 tensor | – | 95.7 TFLOPS | 67 TFLOPS | 51 TFLOPS |
Intel, meanwhile, says its Data Center GPU Max 1550 is 2.4x faster than Nvidia’s A100 with Riskfuel credit option pricing, and offers a 1.5x performance boost over A100 in NekRS virtual reactor simulations. It says it will.
Intel will offer three Ponte Vecchio products. Top-of-the-line data center GPU Max 1550 in an OAM form factor with 128 Xe-HPC cores, 128 GB of HBM2E memory, and a thermal design power rating of up to 600W. Cut-down Data Center GPU Max 1350 in OAM form factor with 112 Xe-HPC cores, 96GB of memory, and 450W TDP. The entry-level Data Center GPU Max 1100 comes in a dual-width FLFH form factor, powered by a processor with 56 Xe-HPC cores, 56 GB of HBM2E memory and rated at 300W TDP.
Meanwhile, Intel offers its Max series subsystems with four OAM modules on carrier boards rated at 1,800W and 2,400W TDP for supercomputer clients.
Intel’s Rialto Bridge: Increased Max
Intel not only officially announced its Data Center GPU Max computing GPU today, but also quietly revealed its next-generation data center GPU, codenamed Rialto Bridge, coming in 2024. This AI and HPC computing GPU is a hardened Xe – HPC core with perhaps a slightly different architecture, but remains compatible with Ponte Vecchi-based applications. Unfortunately, that added complexity would increase the TDP of the next-generation flagship computing GPU to his 800W, but there would be simpler, less power-hungry versions out there.
availability
One of the first customers to get both Intel Xeon Max and Intel Data Center GPU Max offerings is the Xeon Max CPU and Data Center GPU Max device (2 CPUs and 6 GPUs per blade). In addition, Intel and Argonne have completed building Sunspot, Aurora’s test and development system consisting of 128 production blades that will be available to interested parties in the second half of 2022. The Aurora supercomputer is scheduled to come online in 2023.
Intel’s partner server manufacturer will launch machines based on Xeon Max CPU and Data Center GPU Max devices in January 2023.