The impact of AI on power and cooling in the data center

AI data center operations will mean big changes for power delivery and thermal management at a scale not seen before. Does this mean engineers are being asked to accommodate known and unknown unknowns? In some ways, the answer is yes, but a picture is emerging of what needs to be accommodated at a component level.

Smart design engineers tasked with producing power chain topologies and cooling systems in a rapidly growing power draw are embracing the challenges of higher rack densities and liquid cooling. Top of the list is how the power design accommodates high-performance server GPU chipsets running at 1,200W and beyond, finding headroom for powering and cooling networking kits that run at speeds of 400Gbps today (and 800Gbps and beyond soon) that could result in 20°C temperature jumps?

Of microprocessors and networks

Let’s start with the chips. All processors (GPU, CPU and TPU) are measured in performance and TCO. In the case of AI GPU architecture, it is a measurement of TFLOPS or PFLOPS per GPU watt. In watts per GPU terms, Nvidia’s Hopper H100, H200, and Blackwell B100 chips draw 700W. Nvidia’s latest Blackwell GPU architecture, such as the recently announced B200 and GB200NVL72 draw 1,000W and 1,200W, respectively. Nvidia says its NVLink switching technology delivers groundbreaking 1.8TB/s bidirectional throughput per GPU, ensuring seamless high-speed communication among up to 576 GPUs for the most complex LLMs.

When scaled out to thousands or tens of thousands of chips rack deployed in clusters and connected switchgear, this indicates the power density and scale required. Which brings us to cooling: The GPU roadmap from Nvidia certainly points to there being no alternative to liquid. Jensen Huang, CEO of Nvidia was quoted as saying: “Coolant enters the rack at 25°C at two liters per second and exits 20 degrees warmer.” What happens to that heat?

Another chip designer to consider is Google. As well as being an Nvidia customer, Google has its own TPU chip, which it works with ARM to design. As a provider, Google will be deploying its own TPU v5p technology in pods within its data centers and selling services off the back of it. It also announced the general availability of TPU v5p for training large AI models.

On the networking side, the pods that Google is deploying use switching technology known as Jupiter. This is built on OCS (Optical Circuit Switching) networking architecture which promises 40% less power use than previous networking design. But it brings serious design challenges.

Google uses liquid cooling in its data centers. Exactly how much heat is being generated is unknown. But in a paper detailing the design, researchers at Google wrote: “[Two] primary challenges for hyperscale data centers remained. First, data center networks need to be deployed at the scale of an entire building – perhaps 40MW or more of infrastructure. The data center network needs to evolve dynamically to keep pace with the new elements connecting to it.”

Conclusion

There is no longer (if it ever existed) one size fits all in data center power and cooling. The examples above show just some of the considerations for power and cooling design in the rapidly evolving world of AI infrastructure.

New AI workloads are set to push systems to their limits because of the concentrated heat generated by GPUs and TPUs during intensive AI computations. This is challenging the effectiveness of existing cooling infrastructure, potentially leading to overheating issues and reduced hardware lifespan. In response, data centers are typically deploying a variety of cooling techniques such as air cooling, liquid cooling, or a combination of both.

But the pace of change is accelerating how commercial data center operators are gearing up for AI. That means designing mechanical systems to accommodate some form of liquid cooling such as rear-door heat exchangers (RDHX) or direct-to-chip (DTC) liquid cooling.

To a large degree, what is needed will depend on the GPU model being deployed and in what configuration. Building for the latest and future GPU models that require 1,200W per unit certainly means that some form of direct-to-chip liquid cooling is a must.

For a data center designer, the question becomes, how do I rise to the challenge of designing power topologies and cooling schemes that can flex to accommodate today’s known GPU, networking and storage architectures and the likely configurations that will be deployed in the white space? And how can we prepare for future architectures that haven’t been developed yet?

***

This piece was first featured in Issue 7 of W.Media’s Cloud and Data Center Magazine.

Publish on W.Media
Author Info:
Ed Ansett
Global Director of Technology & Innovation, Ramboll
Share This Article
Related Posts
Other Popular Posts