Gaming PC

NVIDIA Announces H100 NVL – Max Memory Server Card for Large Language Models

NVIDIA’s new GPUs and GPU architectures weren’t featured at this year’s Spring GTC event, but NVIDIA is rolling out new products based on the Hopper and Ada Lovelace GPUs it introduced last year. At the high end of the market, the company today announced his H100 NVL, a variant of the new H100 accelerator specifically aimed at users of large language models.

The H100 NVL is an interesting variant of NVIDIA’s H100 PCIe card, aimed at a single market: the deployment of Large Language Models (LLM) as a sign of the age and NVIDIA’s widespread success in the AI ​​space. There are a few things that make this card different from his NVIDIA’s usual server pricing. It’s important to have at least two H100 PCIe boards already bridged together, but more importantly it has a large amount of memory. The combined dual GPU card offers 188 GB of HBM3 memory (94 GB per card). This offers more memory per GPU than any previous NVIDIA part, even within the H100 family.

NVIDIA H100 accelerator spec comparison
H100 NVL H100 PCIe H100SXM
FP32 CUDA core 2 x 16896? 14592 16896
Tensor cores 2 x 528? 456 528
boost clock 1.98GHz? 1.75GHz 1.98GHz
memory clock ~5.1Gbps HBM3 3.2Gbps HBM2e 5.23Gbps HBM3
memory bus width 6144 bits 5120 bits 5120 bits
memory bandwidth 2 x 3.9TB/s 2TB/s 3.35TB/s
VRAM 2 x 94GB (188GB) 80GB 80GB
FP32 vectors 2 x 67 TFLOPS? 51 TFLOPS 67 TFLOPS
FP64 vector 2 x 34 TFLOPS? 26 TFLOPS 34 TFLOPS
INT8 tensor 2 x 1980 tops 1513 Tops 1980 Tops
FP16 Tensor 2 x 990 TFLOPS 756 TFLOPS 990 TFLOPS
TF32 Tensor 2 x 495 TFLOPS 378 TFLOPS 495 TFLOPS
FP64 Tensor 2 x 67 TFLOPS? 51 TFLOPS 67 TFLOPS
interconnect NVLink4
18 links (900GB/s)
NVLink4
(600GB/s)
NVLink4
18 links (900GB/s)
GPUs 2 x GH100
(814mm2)
GH100
(814mm2)
GH100
(814mm2)
number of transistors 2 x 80B 80B 80B
TDP 700W 350W 700-800W
manufacturing process TSMC 4N TSMC 4N TSMC 4N
interface 2 x PCIe 5.0
(quad slot)
PCIe 5.0
(dual slot)
SXM5
architecture hopper hopper hopper

Driving this SKU is a specific niche: memory capacity. Large language models like the GPT family quickly fill up even the H100 accelerator to hold all the parameters (175B for his GPT-3 model, which is the largest), so in many ways Limited by memory capacity. As a result, NVIDIA has opted to round up a new H100 SKU that offers a bit more memory per GPU than his regular H100 parts which reach 80GB per GPU.

What we’re looking at under the hood is a special bin of GH100 GPUs that are essentially located on PCIe cards. All his GH100 GPUs come with 6 stacks of HBM memory (either HBM2e or HBM3) with a capacity of 16GB per stack. However, for yield reasons, NVIDIA is only shipping regular his H100 parts with 5 of his 6 HBM stacks enabled. So each GPU nominally has 96 GB of VRAM, but only 80 GB is available on regular SKUs.

H100 NVL is a fully valid SKU with all 6 stacks enabled. By turning on 6th With the HBM stack, NVIDIA has access to additional memory and additional memory bandwidth provided. That would have some serious impact on yields – how much is a closely guarded NVIDIA secret – but the LLM market is clearly big enough to pay a good enough premium for a near-perfect GH100 package. seems to be willing to

Note that customers still don’t have access to all of the 96 GB per card. Rather, with a total memory of 188 GB, we are effectively getting 94 GB per card. NVIDIA didn’t elaborate on this design quirk in its pre-briefing ahead of today’s keynote, but it’s also believed to be the reason for the yield, and the bad cells (or layers) in the HBM3 memory stack. He’s giving NVIDIA the leeway to disable it. The end result is that the new SKU will offer 14GB of memory per his GH100 GPU, a 17.5% memory increase. On the other hand, the total memory bandwidth of the card is 7.8TB/s and the individual board is 3.9TB/s.

In addition to the increased memory capacity, in many ways the individual cards within the larger dual GPU/dual card H100 NVL are much like the SXM5 version of the H100 arranged on PCIe cards. Regular H100 PCIe is somewhat limited by slower HBM2e memory usage, reduced number of active SM/tensor cores, and lower clock speeds, but the number of tensor cores quoted by NVIDIA for H100 NVL The performance numbers are all on par with the H100 SXM5 and show that. This card is not further truncated like a regular PCIe card. We’re still waiting for the final full specs of the product, but assuming everything here is presented, the GH100 that goes into the H100 NVL represents the best binned GH100 available right now.

The plural form should be emphasized here. As mentioned above, the H100 NVL is a dual GPU/dual card part, not a single GPU part, and appears as such to the host system. The hardware itself is based on two PCIe form factors H100 bonded together using three NVLink 4 bridges. Physically, this is practically identical to NVIDIA’s existing H100 PCIe design (which already pairs with the NVLink bridge), so the difference is not in the construction of the 2-board/4-slot behemoth, but in the internals. In other words, you can combine regular H100 PCIe cards today, but they can’t match the memory bandwidth, memory capacity, or tensor throughput of the H100 NVL.

Surprisingly, despite the excellent specs, the TDP is almost unchanged. The H100 NVL is a 700W to 800W part, broken down to 350W to 400W per board, with the bottom end being the same TDP as the regular H100 PCIe. In this case, NVIDIA seems to prioritize compatibility over peak performance. Few server chassis can handle more than 350W PCIe cards (and even fewer servers over 400W), meaning TDP must be maintained. Still, given the performance numbers and high memory bandwidth, it’s unclear how NVIDIA is delivering the extra performance. Power binning helps a lot here but the target market is mostly concerned with tensor performance and they’re not looking to light up the whole GPU so if nvidia is giving the card higher than normal boost clock speeds too Yes, once.

Otherwise, NVIDIA’s decision to release essentially the best H100 bin is an unusual choice given the general preference for SXM parts, but it makes sense in the context of what LLM customers need. A good decision. A large SXM-based H100 cluster can easily scale up to 8 GPUs, but since it has to go through NVSwitch, the amount of NVLink bandwidth available between any two is restricted. For a two GPU configuration only, pairing a set of PCIe cards is much more straightforward, with a fixed link guaranteeing 600GB/s of bandwidth between the cards.

But perhaps more importantly, the H100 NVL can be rapidly deployed into existing infrastructure. LLM customers don’t need to install a specially built H100 HGX carrier board to pair their GPUs, throwing the H100 NVL into new server builds or as a relatively quick upgrade to existing server builds. You can use After all, NVIDIA is aiming for a very specific market here, so the usual advantages of SXM (and NVIDIA’s ability to swing the weight as a whole) may not apply here.

Overall, NVIDIA touts the H100 NVL as offering 12x the GPT3-175B inference throughput as the previous generation HGX A100 (8 H100 NVL vs. 8 A100). This certainly appeals to customers looking to deploy and scale up systems for LLM workloads as quickly as possible. As mentioned above, the H100 NVL brings nothing new in terms of architectural features. Much of the performance improvement here is due to the new transformer engine in the hopper architecture. However, H100 NVL offers a specific niche as the fastest PCIe H100. option, and the option with the largest GPU memory pool.

According to NVIDIA, the H100 NVL card will start shipping later this year. The company hasn’t put a price on it, but we expect to get the top price for what is essentially the top GH100 bin. given that it is turning into a gold rush.

Related Articles

Back to top button