What the arrival of Nvidia’s Blackwell means for data centre operations

Nvidia’s Blackwell platform will officially launch in 2025, replacing the current Hopper platform and
becoming the dominant solution for its high-end GPUs, accounting for nearly 83 percent of all high-end products, according to market analysts TrendForce. High-performance AI server models like the B200 and GB200 are designed for maximum efficiency, with individual GPUs consuming over 1,000W. HGX models will house 8 GPUs each, while NVL models will support 36 or 72 GPUs per
rack, significantly boosting the growth of the liquid cooling supply chain for AI servers.

TrendForce highlights the increasing thermal design power (TDP) of server chips, with the B200
chip’s TDP reaching 1,000W, making traditional air cooling solutions inadequate. The TDP of the
GB200 NVL36 and NVL72 complete rack systems is projected to reach 70kW and nearly 140kW,
respectively, necessitating advanced liquid cooling solutions for effective heat management.

For data centre operators, the new way of working will be by engaging customers in their DC
design phase to map out what future cooling will look like, hall by hall. Macquarie Data Centres is
already heading down this path and the operator’s VP of sales, Gavin Dudley, explained why
hybrid halls are where the industry is heading because despite AI driving a liquid future, there are
still plenty of options from cold plates, coolant distribution units (CDUs), manifolds, quick
disconnects (QDs), and rear door heat exchangers (RDHx), in addition to immersion – and air
cooling still has a role.

“Whether we use hybrid immersion cooling, a blend of direct-to-chip cold plates, or air cooling, it’s
likely that a combination of all these methods will be used,” he said. “I don’t think we’ll ever see a
fully immersion-cooled data centre. While a hall could potentially be 100% immersion-cooled, it
would still require nearby air-chilled racks or immersion tanks with a ‘dry zone’ for less dense
components like storage and routing, which don’t need as much cooling.”

“It’s rare to have a completely immersion-cooled area. Therefore, it’s crucial to integrate air cooling in the design. Even with immersion cooling, a combination of liquid and air-cooled elements is typically used,” he said, adding that immersion cooling still tends to have a portion of liquid and air cooled elements.

Introducing liquid cooling to a DC

Liquid cooling means piping chilled water directly into the data halls so traditional data centre
operators have new issues to deal with. Liquid tanks are heavy so if you are running raised floors,
you may exceed your floor capacity. Operators need to consider the new tools required for
supporting liquid cooling.

“You’ll need trays to catch any drips when removing servers from the racks, and additional
equipment to manage any potential spills. The different form factors of these components must be considered when allocating space,” he said.

Cooling is generally designed to be effective in certain areas of a hall but not across all halls. “With immersion cooling, most of the cooling is provided by the liquid directly surrounding the
components. This reduces the need for traditional air cooling in areas with immersion tanks, which changes how air cooling is managed around these tanks,” said Dudley. He added that high-density loads require significant power distribution to the racks.

He added that most data centres can take a few racks of immersion cooling, but modern data
centres like Macquarie Data Centre’s are built supporting true hybrid cooling with a mix of air and
liquid. Today, bigger customers are now demanding more than 100% capacity of cooling in a hall
so no longer 70% air and 30% liquid but instead, asking for more flexibility so they can ramp up
liquid cooling over a month so Macquarie Data Centre for example, delivers more than 100% to
make sure it can offer this new demand for flexibility, but does so hall by hall.

This hybrid approach will persist in Australia but Dudley was keen to stress several less-discussed
advantages of immersion cooling that must be factored into DC design. “Direct-to-chip and cold
plate cooling will be standout performers. A significant portion of the industry will adopt these
methods,” he said. “However, immersion cooling offers benefits, including a 15% to 20% increase
in chip performance due to its stable temperature. It’s like being able to drive faster on a smoother road, leading to more efficient processing.”

“There’s fewer hot and cold points on the boards in the servers, when immersed in liquid, so
they’re getting fewer warranty claims” he added. Dudley said that immersion tanks can also act as  a Faraday cage to offer some protection against electromagnetic pulses from either the sun or from state actors.

New SLAs will develop

The demands mean new operational changes. For starters, Dudley points out plumbers will be
kings given how complicated some of the manifolds are inside the racks. More important, he points out, there’s going to be changing points of demarcation. “You don’t normally talk to clients about the temperature of the chilled water for example, you talk to them about the temperature of the room in an air cooled room,” he said. “Whereas as you get into immersion and liquid cooling, you need to be thinking more in terms of both the temperature, and the volume or flow rate, of chilled water to and from their appliances.”

“So there’s a different SLA metric that people are looking at,” he said. “Also as we move into – not
so much immersion – but other liquid forms where we’ve got a CDU out on the floor, that is driving
a liquid-to-liquid cooling platform that the customer would use. We will find that we’re now having points of demarcation well inside the data hall when someone owns the whole data hall.”

A customer’s data hall may currently have mesh walls and the chillers outside in the service
corridors, keeping the DC engineers physically separated from the servers. “With liquid cooling,
you absolutely will have points of demarcation inside the customer hall,” he said. This will mean
new SLAs which could even be specific SLAs down to individual data hall or rack row.

Dudley added that for new AI/ML workloads resilience may be paramount over something like
uptime given the way these processes need to be reprocessed (potentially losing hours or
days of work) if they fail. “Data centre operators psychologically need to change the way they
engage with customers,” he said, noting that they can no longer function merely as “hosting
hotels”. The design process is going to be much more collaborative and demanding, he said. “We’ll need to be more involved, asking questions like ‘What cooler do you want? Where should it be placed? How should we design this?’ We’ll work much closer with customers than ever before,” he added.

Blackwell’s arrival will accelerate these SLAs

Nvidia’s GB200 shipments are expected by TrendForce to reach 60,000 units in 2025, making
Blackwell the mainstream platform and accounting for over 80% of Nvidia’s high-end GPUs.
Dudley points out there’s going to be many more flavours of Blackwell than previous chipsets and
this is going to make design-led conversations with customers even more important. For DC
operators that have chiller units in their halls and liquid cooling to be able to be distributed through the floor, they will at least be able to cover off most requirements.

In addition, the way AI/ML will process as a pool or resources connected to the same plane means
that 50 metres apart becomes too far for processing from a latency perspective. This then favours immersion cooling and direct-to-chip cooling because it allows people to pack so many things together. Dudley points out that latency and throughput are also critical advantages for the liquid cooling methods – it isn’t just about efficiency. “So you get these super latency sensitive workloads inside the data centre but between data centres, much less so,” he said.

Gavin Dudley will deliver a Keynote: ‘Australia’s AI Dilemma. Can we go from lagging to leading
the global innovation race?’ at W.Media’s Sydney Cloud and Datacenter Convention 2024 at the
Sydney International Convention Centre on 12 September 2024. As Sydney maintains its status as
a leading cloud and datacenter hub in the Asia Pacific, our event will spotlight the latest
advancements in digital infrastructure and their impact on IT, business, and society.

Building on the success of the 2023 convention which welcomed over 700 attendees, the 2024 edition will feature thought leaders, industry experts, and dynamic speakers who will share insights, case studies, and engage in lively debates. Attendees can look forward to keynote presentations, panel discussions, tech demonstrations, and ample networking opportunities. Join us for a day of innovation, learning, and connection in the heart of Sydney.
https://clouddatacenter.events/events/sydney-cloud-datacenter-convention-2024/

[Author: Simon Dux]

Publish on W.Media
Author Info:
Picture of Nick Parfitt
Nick Parfitt
Share This Article
Related Posts
Other Popular Posts