AMD Instinct MI300 Details Emerge, Debuts in 2 Exaflop El Capitan Supercomputer

AMD’s Instinct MI300 is shaping up to be an incredible chip that integrates CPU and GPU cores and massive slabs of fast memory on the same processor, but details are yet to be revealed. yeah. Here we have gleaned some new details from the International Super Computing (ISC) 2023 presentation outlining the upcoming 2 exaflop El Capitan supercomputer powered by the Instinct MI300. Other details were also revealed during his AMD Chief Technology Officer Mark Papermaster keynote at ITF World 2023, a conference hosted by research giant imec (you can read our interview with Papermaster here).
The El Capitan supercomputer is set to be the fastest in the world when it powers up in late 2023, displacing AMD-powered Frontier as the leader. The machine is powered by AMD’s mighty Instinct MI300, and new details include a topology map of the MI300 installation, photos of AMD’s Austin MI300 lab, and photos of the new blades that will be featured in the El Capitan supercomputer. . We also cover some other new developments in El Capitan deployment.
Mind you, the Instinct MI300 blends a total of 13 chiplets, many of them in the 3D stack, with 24 fused CDNA 3 graphics engines and 8 Zen 4 CPU cores. A data center APU that creates a single chip package with a Zen 4 CPU core. A stack of HBM3 memory totaling 128 GB. All in all, the chip has 146 billion transistors, making it the largest chip AMD has put into production. Nine compute dies that combine 5nm CPUs and GPUs are 3D stacked atop four 6nm-based dies, which are active interposers that handle functions such as memory and I/O traffic.
Papermaster’s ITF World keynote talks about AMD’s “30×25” goal of 30x power efficiency by 2025 and how computing is being controlled by power efficiency as Moore’s Law slows down. Focused. Key to that effort is the Instinct MI300, whose many benefits derive from the simplified system topology shown above.
As you can see in the first slide, the Instinct MI250 powered nodes have separate CPUs and GPUs, with a single EPYC CPU in the middle to coordinate workloads.
In contrast, the Instinct MI300 incorporates a 24-core 4th Gen EPYC Genoa processor inside the package, eliminating the need for a standalone CPU. However, except for the standalone CPU, the overall topology remains the same, allowing a fully connected all-to-all topology with four elements. This type of connection reduces latency and variability by allowing all processors to communicate directly with each other without using another CPU or GPU as an intermediary to relay data to other elements. This is a potential problem with the MI250 topology. The MI300 topology map also shows that each chip has his three connections, similar to what we saw with the MI250. Papermaster’s slides also refer to the active interposer that forms the base die as the “4th Generation Infinity Fabric Base Die”.
As you can see in the rest of these slides, the MI300 put AMD on a clear path to surpassing their 30X25 efficiency target while also beating the industry power trend. We’ve also posted some photos of the Instinct MI300 silicon that we’ve seen in person, but below you can see what his MI300 looks like inside the actual blade that will be installed in El Capitan.
AMD Instinct MI300 in El Capitan
At ISC 2023, Bronis R. de Supinski, CTO of Lawrence Livermore National Laboratory (LLNL), spoke about the integration of the Instinct MI300 APU into the El Capitan supercomputer. The National Nuclear Security Administration (NNSA) plans to use El Capitan to facilitate the military use of nuclear technology.
As you can see in the first image in the album above, Spinski showed a single blade for the El Capitan system. This blade is from system vendor HPE and features his 4x liquid cooled Instinct MI300 cards in his slim 1U chassis. Supinksi also showed a photo of his in AMD’s Austin labs with the MI300 silicon in action, indicating that the chip is real and already in testing. This is an important point considering some of the recent failures of Intel-based systems.
Supinksi used to call the MI300 “MI300A”, but I’m not sure if that’s a custom El Capitan model or a more official part number.
Sapinski said the chip would have an Infinity cache, but didn’t reveal how much capacity was available. Supinski also repeatedly mentions the importance of a single memory layer, unified he memory space reduces the complexity of moving data between different types of computing and different memory he pools. So I mentioned how it simplifies programming.
Supinski said the MI300 can run in several different modes, but the primary mode consists of one memory domain and a NUMA domain, providing uniform access memory to all CPU and GPU cores. says. Importantly, cache-coherent memory reduces data movement between the CPU and GPU, often consuming more power than the computation itself, thus reducing latency and improving performance and power efficiency. about it. Supinsi also said porting code from the Sierra supercomputer to El Capitan was relatively straightforward.
The rest of Supinski’s slides contain information already published by AMD, including performance projections of 8x the AI performance of the MI250X and 5x the performance per watt.
HPE builds El Capitan systems based on the Shasta architecture and the Slingshot-11 network interconnect. This is the same platform that powers both DOE’s other exascale supercomputers, Frontier, the world’s fastest supercomputer, and the Intel silicon-powered, high-latency Aurora.
NNSA needed to build more infrastructure to operate Sierra supercomputers and El Capitan simultaneously. This work included increasing the dedicated computing power supply from 45 MW to 85 MW. An additional 15 MW of power is available for the cooling system, which has been upgraded to 28,000 tons with the addition of a new 18,000 tons cooling tower. This will give the site a total of 100 MW of power, while El Capitan’s power consumption is expected to be less than 40 MW, although the actual figure could be around 30 MW. . We don’t know the final numbers until it’s deployed.
El Capitan will be the first Advanced Technology System (ATS) to use NNSA’s custom Tri-lab Operating System Software (TOSS), a complete software stack built on RHEL.
Rabbit program for El Capitan storage
LLNL is using a smaller ‘EAS3’ system to demonstrate the software that will be deployed on El Capitan when it goes live later this year. LLNL is already testing a new Rabbit module that hosts tons of his SSDs as local storage for near nodes. Note that the block diagrams of these nodes are shown above, but do not use the MI300 accelerator. Instead, it has standard EPYC server processors for storage orchestration and data analytics tasks. These high-speed nodes act as burst buffers that quickly absorb large amounts of incoming data before being shuffled to slower mass storage systems.
AMD Instinct MI300 Timeline
With development continuing at a predictable pace, El Capitan is clearly on track to go live later this year. The MI300 will pave the way for AMD’s high-performance computing offerings, but AMD says these Halo MI300 chips will be expensive and relatively rare. These aren’t mass-produced products, so they won’t be as widely deployed as EPYC. His CPU in the Genova data center. However, the technology is expected to be narrowed down to multiple variations in different form factors.
The chip also competes with Nvidia’s Grace Hopper Superchip, which combines a Hopper GPU and a Grace CPU on the same board. These chips are expected to appear later this year. The Neoverse-based Grace CPU supports the Arm v9 instruction set, and the system features two of his chips fused with Nvidia’s newly branded NVLink-C2C interconnect technology. In contrast, AMD’s approach is designed to provide superior throughput and energy efficiency. Usually when he combines these devices into one package he gets better throughput between the units than when he connects two separate devices like Grace Hopper.
The MI300 was also supposed to compete with Intel’s Falcon Shores. The chip was originally designed to feature x86 cores, GPU cores, and memory with varying numbers of compute tiles in varying configurations. Intel recently delayed them to 2025 and redefined the chip to feature only GPUs and AI architectures. CPU cores will no longer be installed. Effectively this leaves Intel with no direct competitor for his Instinct MI300.
Given that El Capitan’s power-up date is fast approaching and AMD’s reputation for completing supercomputers on time, we can expect AMD to start sharing more information about its Instinct Mi300 APUs soon. increase. AMD will be hosting a livestream event for its next-generation data center and AI technology on June 13th, where we hope to learn more. We will be sure to keep you updated on the event as soon as it arrives.