Tesla has added thousands of Nvidia A100 GPUs to power its in-house AI supercomputer. About a year ago he had 5,760 A100 GPUs in the Tesla supercomputer, but that number has since increased to his 7,360 A100 GPUs. That’s an additional 1,600 GPUs for him, or an increase of about 28%.
According to Tim Zaman, engineering manager at Tesla, the upgrade will make the company’s AI system one of the world’s top seven supercomputers by number of GPUs.
The Nvidia A100 GPU is a powerful Ampere architecture solution for the data center. Yes, they use the same GPU architecture as the GeForce RTX 30-series GPUs and are some of the best graphics cards available today. However, with 80 GB of HBM2e memory, offering up to 2 TB/s of bandwidth, and requiring up to 400 W of power, the A100 doesn’t have a close relationship with consumers. A100’s architecture is tuned to accelerate tasks common in AI, data analytics, and high-performance computing (HPC) applications.
The first system Nvidia has shown using A100 is the Nvidia DGX A100, which consists of 8 A100 GPUs linked via 6 NVSwitches, with 4.8 TBps of bi-directional bandwidth and a maximum of INT8 performance. 10 PetaOPS, 5 PFLOPS for FP16 and 2.5 TFLOPS for TF32. , 156 TFLOPS for FP64 on a single node.
It was 8 A100 GPUs. Tesla’s AI supercomputer currently has 7,360 GPUs. Tesla hasn’t published its own benchmarks for his AI supercomputer, but his similarly-equipped GPU-based NERSC Perlmutter is powered by 6,144 of his Nvidia A100 GPUs, delivering 70.87 Linpack petaflops. has achieved Using this and other data from his A100 GPU supercomputer as a reference point for performance, HPC wire estimates that the Tesla AI supercomputer can achieve about 100 Linpack petaflops.
Tesla has no long-term intention to continue the Nvidia GPU architecture path for its AI supercomputer. The world’s top 7 machines by GPU count are just a harbinger of what’s next. dojo the first supercomputer announced By Elon Musk, dating back to 2020. A year ago we saw the Tesla D1 Dojo chip, which replaced Nvidia’s GPU and was designed to deliver “maximum performance, throughput, and bandwidth at every grain.”
The Tesla Dojo D1 is a custom ASIC (Application Specific Integrated Circuit) design for AI training and one of the first ASICs in the field. The current D1 test chip is manufactured at TSMC N7 and has about 50 million transistors.
More details on the Dojo D1 chip and Dojo system may be revealed at next week’s Hot Chip Symposium. Next Tuesday, he has three Tesla presentations on the Dojo D1 chip architecture, Dojo and ML training, and enabling AI through system integration.