Intel Ponte Vecchio Seemingly Offers 2.5x Higher Performance Than Nvidia’s A100
Intel detailed its Ponte Vecchio Xe-HPC GPUs at Hot Chips 34. In the benchmarks provided, the chipmaker claims the Ponte Vecchio offers up to 2.5x more performance than his Nvidia A100. However, as a convention, add a little salt to vendor-supplied benchmarks.
The Ponte Vecchio significantly outperformed the A100 on several Intel-selected benchmarks. The Intel powerhouse also boasted a 2x lead in miniBUDE and a 1.5x lead in ExaSMR. It’s an interesting comparison considering the Ponte Vecchio hasn’t launched yet and the A100 (Ampere) has been on the market since his 2020. Also, do not forget that AMD’s Instinct MI250X (Aldebaran) is reported to be 3 times faster than he A100. So Intel should be worried about next-generation HPC products from AMD and Nvidia.
If Intel’s numbers are accurate, Ponte Vecchio could be a potential competitor to Nvidia’s next-generation H100 (Hopper). Based on the specs so far, the H100 should be at least twice as fast as the A100. It’s even more menacing with AMD’s Instinct MI300, fusing both a Zen 4 CPU and a CDNA 3 GPU chiplet into his one offering. Billed as the world’s first data center APU, AMD claims that the Instinct MI300 will give him an 8x improvement in AI training performance compared to his Instinct MI250X.
The Ponte Vecchio comes in three flavors: OAM, x4 Subsystem with Xe Link, and x4 Subsystem with Xe Link on the dual-socket Sapphire Rapids platform. Unfortunately, Sapphire Rapids hit so many delays that it’s no longer fun. Barring further setbacks, some Sapphire Rapids products could finally debut in his October. Nonetheless, a large amount of chips may not arrive until his February 2023.
Ponte Vecchio supports both 4 and 8 GPU platforms in its OAM form factor. A two-stack Ponte Vecchio configuration extracts 52 TFLOPs of FP32 and FP64 performance. For comparison, a single H100 SXM5 module peaks at 60 TFLOPs of FP32 and 30 TFLOPs of FP64 performance.
The Ponte Vecchio features a 64MB register file and outputs up to 419 TBps of bandwidth. The L1 and L2 caches are 64MB and 408MB respectively. Ponte Vecchio’s large L2 cache is useful for certain workloads such as 2D-FFT and DNN cases. In its presentation, Intel’s results show a significant performance improvement from 80MB to 408MB in both scenarios.