Nvidia’s Chinese A800 GPU’s Performance Revealed

Rather simple story An article about China’s overwhelming demand for Nvidia’s high-performance computing hardware has revealed the performance of Nvidia’s mysterious A800 computing GPU, made for the Chinese market. According to MyDrivers, the A800 runs at 70% the speed of his A100 GPU and complies with strict US export standards that limit the processing power Nvidia can sell.
A 3 year old Nvidia A100 performs quite well. It delivers 9.7 FP64/19.5 FP64 Tensor TFLOPS for HPC and up to 624 BF16/FP16 TFLOPS for AI workloads (with sparsity). Even with about a 30% reduction, these numbers still look daunting.
Despite the “castration” (performance cap), as MyDrivers puts it, Nvidia’s A800 is a serious competitor in terms of computing power to the full-fledged Chinese-based Biren’s BR104 and BR100 compute GPUs. Meanwhile, Nvidia’s computing GPUs and their CUDA architecture are widely supported by the applications customers run, while Biren’s processors have yet to adopt. And even Biren can’t ship serious computing GPUs to China due to the latest regulations.
row 0 – cell 0 | Billen BR104 | NVIDIA A800 | NVIDIA A100 | NVIDIA H100 |
form factor | FHFL Card | FHFL card (?) | SXM4 | SXM5 |
number of transistors | ? | 54.2 billion | 54.2 billion | 80 billion |
node | N7 | N7 | N7 | 4N |
Power | 300W | ? | 400W | 700W |
FP32 TFLOPS | 128 | 13.7 (?) | 19.5 | 60 |
TF32+ TFLOPS | 256 | ? | ? | ? |
TF32 TFLOPS | ? | 109/218* (?) | 156/312* | 500/1000* |
FP16 TFLOPS | ? | 56 (?) | 78 | 120 |
FP16 TFLOPS tensor | ? | 218/437* | 312/624* | 1000/2000* |
BF16 TFLOPS | 512 | 27 | 39 | 120 |
BF16 TFLOPS tensor | ? | 218/437* | 312/624* | 1000/2000* |
INT8 | 1024 | ? | ? | ? |
INT8 TFLOPS tensor | ? | 437/874* | 624/1248* | 2000/4000* |
*Sparse
Export restrictions imposed by the United States in October 2021 allow supercomputers with performance in excess of 100 FP64 petaflops or 200 FP32 petaflops to be exported to China in 41,600 cubic feet (1,178 cubic meters) of space or less. Ban on technology exports. The export restrictions do not specifically limit the performance of each compute GPU sold to China-based entities, but they do limit throughput and scalability.
After the new rules went into effect, Nvidia lost its ability to sell its ultra-high-end A100 and H100 computing GPUs to China-based customers without an export license. This is difficult to obtain. To meet the demand for performance needed by Chinese hyperscalers, the company has introduced a scaled-down version of his A100 GPU called A800. Until now, the power of this GPU was not obvious.
With the increasing use of artificial intelligence by both consumers and businesses, the popularity of high-performance hardware capable of handling the right workloads is skyrocketing. Nvidia is one of the main beneficiaries of the AI megatrend, which is why the company’s GPUs are in such high demand that even a scaled-down version of his A800 is sold out in China.
Biren’s BR100 comes in an OAM form factor and consumes up to 550W of power. The chip supports the company’s proprietary 8-way BLink technology that allows up to 8 BR100 GPUs to be installed per system. In contrast, the 300W BR104 ships in an FHFL dual-width PCIe card form factor and supports up to 3-way multi-GPU configurations. Both chips use a PCIe 5.0 x16 interface with the CXL protocol for accelerators on top. EE trends (via video cards).
Biren has said that both chips will be manufactured using TSMC’s 7nm-class manufacturing process (he didn’t elaborate on whether they would use N7, N7+, or N7P). The larger BR100 has 77 billion transistors, which he also beats the 54.2 billion in his Nvidia A100, which he made using one of TSMC’s N7 nodes. The company also said it had to use the chiplet design and foundry’s CoWoS 2.5D technology to overcome the limitations imposed by TSMC’s reticle size. This is perfectly logical as Nvidia’s A100 is close to reticle size and the BR100 is supposed to be uniform. It is larger due to the large number of transistors.
Given the specs, we can assume that the BR100 is basically using two BR104s, but the developers haven’t officially confirmed that.
To commercialize the BR100 OAM accelerator, Biren collaborated with Inspur to develop an 8-way AI server that will start sampling from Q4 2022. Baidu and China Mobile are among his first customers to use Biren’s computing GPUs.