Nvidia Reveals Ada Lovelace GPU Secrets: Extreme Transistor Counts at High Clocks
When Nvidia unveiled its Ada Lovelace family of graphics processing units earlier this week, it mostly focused on its top-end AD102 GPU and flagship GeForce RTX 4090 graphics card. Not many details have been released about the AD103 and AD104 graphics chips. Luckily, Nvidia has today uploaded an Ada Lovelace whitepaper that contains a ton of data about their new GPUs and fills in many gaps. We’ve updated all known hubs for the RTX 40-series GPUs with new details, but here’s a quick rundown of the new and interesting information.
Big GPU for big games
We already know that Nvidia’s top-of-the-line AD102 is a 608 mm^2 GPU containing 76.3 billion transistors, 18,432 CUDA cores and 96MB of L2 cache. We also know that the AD103 is a 378.6 mm^2 graphics processor with 45.9 billion transistors, 10,240 CUDA cores and 64MB of L2 cache. As for the AD104, it has a die size of 294.5 mm^2, 35.8 billion transistors, 7680 CUDA cores and 48 MB of L2.
GPU/graphic card | Full AD102 | RTX4090 | RTX4080 16GB | RTX4080 12GB | RTX 3090 Ti |
---|---|---|---|---|---|
architecture | AD102 | AD102 | AD103 | AD104 | GA102 |
process technology | TSMC 4N | TSMC 4N | TSMC 4N | TSMC 4N | Samsung 8LPP |
Transistor (billion) | 76.3 | 76.3 | 45.9 | 35.8 | 28.3 |
Die size (mm^2) | 608 | 608 | 378.6 | 294.5 | 628.4 |
streaming multiprocessor | 144 | 128 | 76 | 60 | 84 |
GPU core (shader) | 18432 | 16384 | 9728 | 7680 | 10752 |
Tensor cores | 576 | 512 | 320 | 240 | 336 |
Ray Tracing Core | 144 | 144 | 80 | 60 | 84 |
TMU | 512 | 512 | 304? | 240 | 336 |
ROP | 192 | 192 | 112 | 80 | 112 |
L2 Cache (MB) | 96 | 96 | 64 | 48 | 6 |
Boost Clock (MHz) | ? | 2520 | 2505 | 2600 | 1860 |
TFLOPS FP32 (Boost) | ? | 82.6 | 48.7 | 40.1 | 40.0 |
TFLOPS FP16 (FP8) | ? | 661 (1321) | 390 (780) | 319 (639) | 320 (none) |
TFLOPS Ray Tracing | ? | 191 | 113 | 82 | 78.1 |
memory interface (bits) | 384 | 384 | 256 | 192 | 384 |
Memory speed (GT/s) | ? | twenty one | 22.4 | twenty one | twenty one |
Bandwidth (GBps) | ? | 1008 | 736 | 504 | 1008 |
TDP (Watts) | ? | 450 | 320 | 285 | 450 |
Release date | ? | October 12, 2022 | November 2022? | November 2022? | March 2022 |
launch price | ? | $1,599 | $1,199 | $899 | $1,999 |
One of the interesting things Nvidia mentions in its whitepaper is that the Ada Lovelace GPU uses fast transistors in the critical path to boost maximum clock speeds. As a result, a fully capable AD102 GPU with 18,432 of his CUDA cores “can be clocked above 2.5 GHz while maintaining the same 450W TGP.” With this in mind, it should come as no surprise that the company is talking about the 3.0 GHz clock of the GeForce RTX 4090 (which has 16,384 CUDA cores) reached in the lab. Definitely tops the list of cards.
In addition to high clock speeds, Nvidia’s Ada Lovelace GPUs also feature a large L2 cache to boost performance for computationally intensive workloads (ray tracing, path tracing, simulations, etc.) and reduce memory bandwidth requirements. reduce Basically, Nvidia’s Ada GPUs take a page from the RDNA 2 Infinity Cache book (here) as a reference, but the general target for the new architecture is pretty much where AMD’s Radeon RX 6000 series products debut in 2020. I believe it was set before.
Speaking of workloads like simulation, it should be noted that in the world of supercomputers, they are run on numbers in double-precision floating-point format (FP64) to improve the accuracy of results. FP64 is more expensive than FP32, both in terms of performance and hardware complexity. This is why computer graphics use his FP32 format and many simulations of non-critical assets are also done in his FP32 precision. The AD102 GPU, on the other hand, contains just 288 of his FP64 cores (2 per streaming multiprocessor) so that programs containing FP64 code, including FP64 Tensor Core code, work correctly.
Still, the AD102’s FP64 rate is 1/64th the TFLOP rate of FP32 operation (which is consistent with the Ampere architecture). Nvidia does not show FP64 cores in his streaming multiprocessor (SM) module diagram nor does he disclose the number of such cores within the AD103 and AD104 GPUs. His low FP64 rate for Ada graphics processors highlights that these parts are primarily geared towards gaming.
More Transistors = More Performance
The complexity and die size of Nvidia’s Ada Lovelace graphics processors compared to the company’s Ampere GPUs shouldn’t come as a surprise. The new Ada GPUs are made using TSMC’s 4N (5nm-class) manufacturing technology, while Ampere is manufactured on Samsung Foundry’s 8LPP process (10nm-class node with 10% optical shrinkage). This added complexity (transistor count) enables significant performance gains such as ray tracing and quality improvements with DLSS 3.0.
GPU/graphic card | AD102 | RTX4090 | RTX4080 16GB | RTX4080 12GB | RTX 3090 Ti |
---|---|---|---|---|---|
GPUs | AD102 | AD102 | AD103 | AD104 | GA102 |
TFLOPS FP32 (Boost) | ? | 82.6 | 48.7 | 40.1 | 40.0 |
TFLOPS FP16 (FP8) | ? | 661 (1321) | 390 (780) | 319 (639) | 320 (none) |
TFLOPS Ray Tracing | ? | 191 | 113 | 82 | 78.1 |
Another thing to note is that Nvidia’s AD102 GPU has higher transistor density than its lower siblings. On the other hand, a 3.6% increase in transistor density allows the AD102 to pack significantly more execution units compared to its smaller siblings. On the other hand, however, the relaxed transistor density of AD103 and AD104 often allows for better yields (assuming node defect densities are not typically high) and higher clocks.
It is difficult to make predictions about the frequency potential of the AD103 and AD104 without access to the actual hardware and knowledge of the actual yield rate. However, if AD102 can operate from 2.50 GHz to 3.0 GHz, AD103 and AD104 have even higher potential. The RTX 4080 12GB uses a fully enabled AD104 chip running at 2610 MHz, the RTX 4080 16GB uses 95% of the AD103 chip (76 out of 80 SMs) running at 2505 MHz, RTX We also know that the 4090 uses only 89% (128 out of 144 SMs) running at 2510 MHz, with 25% of the L2 cache disabled.
Having so many execution units at high complexity and high clocks should result in significant performance gains. Nvidia’s GeForce RTX 4090 has more than double his maximum FP32 compute speed (~82.6 TFLOPS) compared to the GeForce RTX 3090 Ti (~40 TFLOPS).
Meanwhile, Nvidia’s current lineup of Ada GPUs for demanding gamers shows that the company is on track with its three-chip approach to the high-end gaming market. Usually Nvidia releases flagship gaming GPUs, followed by chips with around 66%-75% resources (e.g. CUDA cores) of flagship units, then around 50% of flagship units announced a graphics processor powered by With the Ampere family, Nvidia’s GA103 chip was designed primarily with laptops in mind and was rarely used for desktops, so that strategy has been adjusted somewhat (and late to the party, too). But in the Ada generation, Nvidia has three chips.
More SKUs in stock
One interesting point is the difference in maximum configurations offered by the AD102 GPU and the GeForce RTX 4090 graphics card. The AD102 has 18,432 CUDA cores, while the GeForce RTX 4090 has 16,384 CUDA cores enabled. Such an approach gives Nvidia some flexibility in terms of yields and future new graphics card introductions, so it’s pushing the RTX 4090 Ti, RTX 4080 Ti, and RTX 5500/5000 Ada Generation into the ProViz market and more. Plenty of room to put in.
The GeForce RTX 4080 16GB and RTX 4080 12GB, on the other hand, use near-perfect AD103 and full-blown AD104 GPUs respectively. We don’t know what the future holds, but we expect to see cut-down versions of the AD103 and AD104 GPUs eventually. We can speculate about the GeForce RTX 4070 Ti and/or RTX 4070 based on the cutdown bins of the AD104 chips. We can also speculate on the possibility of an ultra-high-end graphics solution for laptops with the AD103 graphics processor, but we’ll speculate on the specs for those parts.
some thoughts
Nvidia’s Ada Lovelace architecture is a qualitative and quantitative leap over the Ampere architecture. Nvidia has not only significantly improved the performance of ray tracing, tensor cores and other units at his level of architecture, but also increased their numbers and improved clocks. The main enhancement here is the significant increase in L2 cache for Ada GPUs compared to Ampere GPUs.
These leaps are made possible in a big way by TSMC’s Nvidia GPU-optimized 4N process technology. Additionally, the company used high-speed transistors to increase the frequency of its new graphics processors, further boosting performance.
However, the cutting-edge production nodes and the large die sizes of Nvidia’s new GPUs make the parts significantly more expensive to manufacture. As such, the GeForce RTX 4080 and 4090 graphics cards are priced significantly higher than their direct predecessors.
Nvidia has introduced only five Ada Lovelace-based products so far. GeForce RTX 4080 12GB, RTX 4080 16GB, and RTX 4090 graphics cards for desktops, plus RTX 6000 Ada generation for workstations/data centers, and L40 (Lovelace 40) boards for heavy loads. End workstations and virtualized workstation environments.
Given that the company can offer cut-down versions of the full-fat AD102 and AD102, AD103, and AD104 GPUs, we can expect a number of new GeForce RTX 40-series cards for client machines and Ada RTX-series solutions for data centers. Meanwhile, Nvidia is likely preparing some smaller GPUs (AD106, AD107), so it looks like the Ada Lovelace product family is at least as broad as Ampere’s lineup.