Gaming PC

As HPC Chip Sizes Grow, So Does the Need For 1kW+ Chip Cooling

One trend in more and more high performance computing (HPC) space is that power consumption per chip and per rack unit does not stop at the air cooling limit. As supercomputers and other high-performance systems are already a hit and in some cases exceed these limits, power requirements and power densities continue to grow. We should also expect this trend to continue as TSMC lays the foundation for even denser chip configurations, based on news from TSMC’s recent annual technical symposium.

The problem at hand is not new. The power consumption of a transistor is not reduced as fast as the size of the transistor. Also, because chip makers don’t want to leave performance on the table (and can’t provide customers with a semi-annual increase), the power per transistor is growing rapidly in HPC space. As an additional wrinkle, chiplets are paving the way for building chips with even more silicon than traditional reticle limits. It’s good for performance and delay, but it’s even more problematic for cooling.

It is the latest technologies like TSMC’a CoWoS and InFO that enable this kind of silicon and power growth, and chipmakers have double the amount of integrated multi-chiplets allowed by TSMC. You can build a system-in-package (SiP). Reticle restrictions. By 2024, advances in TSMC’s CoWoS packaging technology will allow the construction of even larger multi-chiplet SiPs, and TSMC expects to stitch four reticle-sized chipsets upwards. TSMC and its partners are looking at performance and performance, but of course at the expense of formidable power consumption and heat generation.

Already, flagship products like NVIDIA’s H100 Accelerator Module require more than 700W of power for best performance. Therefore, one product can have multiple GH100 sized chiplets, increasing the eyebrow and power budget. TSMC anticipates that a multi-chiplet SiP with power consumption of about 1000 W or more will appear in the next few years, and there will be cooling issues.

At 700W, the H100 already requires liquid cooling. The story is pretty much the same with Intel’s chiplet-based Ponte Vecchio and AMD’s Instinct MI250X. But even traditional liquid cooling has its limits. By the time the chip reaches a cumulative 1kW, TSMC expects data centers to use immersion liquid cooling systems for such extreme AI and HPC processors. Second, cooling the immersion liquid requires a redesign of the data center itself. This is a major design change and a major challenge for continuity.

Aside from short-term challenges, once the data center is set up for immersion liquid cooling, it will be able to handle even hotter chips. Immersion cooling can handle large cooling loads.This is Intel Make a large investment in this technology Trying to make it more mainstream.

In addition to immersion liquid cooling, there is another technique that can be used to cool ultra-high temperature chips, on-chip water cooling. Last year, TSMC revealed that it had experimented with on-chip water cooling and said it could use this technology to cool even 2.6kW SiP. But of course, on-chip water cooling is a very expensive technology in itself, pushing the cost of these extreme AI and HPC solutions to unprecedented levels.

Still, the future is undecided, but it seems to be cast in silicon. TSMC’s chip manufacturing clients are willing to pay the highest price for these ultra-high performance solutions (think hyperscale cloud data center operators), even if they are costly and technically complex. I have it. TSMC was first developing the CoWoS and InFO packaging processes because it has enthusiastic customers ready to push the boundaries of the reticle through chiplet technology. Some of this is already seen in products such as Cerebras’ large Wafer Scale Engine processor. Through large chiplets, TSMC is preparing to make smaller (but reticle-breaking) designs more accessible to a wider customer base.

These extreme requirements for performance, packaging, and cooling not only push semiconductor, server, and cooling system producers to the limit, but also require cloud data center changes. If large SiPs for AI and HPC workloads actually become widespread, cloud data centers will be completely different in the coming years.

Related Articles

Back to top button