World’s Fastest Supercomputer Can’t Run a Day Without Failure
Building a supercomputer is always challenging, but building the industry’s first exascale-class system is a totally unexpected encounter and requires a lot of hardware and software work. Unfortunately, this may be happening on his Frontier supercomputer at Oak Ridge National Laboratory. This computer could barely last him a day without many hardware failures.
ORNL’s Frontier was designed to deliver peak performance of up to 1.685 FP64 ExaFLOPS using AMD’s 64-core EPYC Trento processors, Instinct MI250X compute GPUs, and HPE’s Slingshot interconnect at 21 MW of power It is the first system in the industry. HPE builds systems, Clay EX (opens in new tab) An architecture designed for scale-out applications primarily for ultrafast supercomputers.
On paper, the Frontier supercomputer looks very good and provides the hardware parts of the machine system, but due to a hardware issue the machine came online and was running at about 1 FP64 ExaFLOPS. It seems to prevent it from becoming available to researchers who need performance.
As Justin Whitt, Program Director of the Oakridge Leadership Computing Facility (OLCF) said in the following interview: Internal HPC (opens in new tab)“At this scale, it will fail. The average time to failure for a system of this size is hours, not days.”
Rumors of a potential Frontier hardware failure have been circulating for quite some time. According to another source, the system said he was experiencing problems with the Slingshot interconnect. Internal HPC (opens in new tab) Talk. Additionally, some have said that AMD’s Instinct MI250X computing GPU is not as reliable as expected this year. Note that the X version with more stream processors and higher clocks is only available to select customers.
Witt didn’t confirm that Instinct or Slingshot had any specific system issues, but he did highlight that the machine had a number of hardware issues.
“Many challenges are focused on [GPUs]but that’s not the bulk of the challenges we’re seeing,” said the OLCF chief. I don’t think we have a lot of concerns about AMD products at this point.”
Oak Ridge National Lab’s Frontier supercomputer isn’t the only system to use HPE’s Cray EX architecture with Slingshot interconnects, AMD’s EPYC CPUs, and AMD’s Instinct computing GPUs. For example, Finland’s Lumi supercomputer (Cray EX, EPYC Milan, Instinct MI250X compute GPU) offers a peak performance of 550 petaflops and is officially ranked as his third most powerful supercomputer in the world. Perhaps the problem is valid at the scale of the machine with a total of 60 million parts.
Given that the Frontier supercomputer, which was originally promised to come online in 2022, has yet to be officially deployed, only time will tell if it will be available to researchers from 2023.