With AI driving the adoption of liquid cooling have we come to the point where we’ve now introduced a single point of failure into our facilities…the fluid? Speaking at the Melbourne Cloud and Datacenter Convention, BP’s Asia Pacific data centre market development director Mark Roberts cautioned that with the rise of direct liquid cooling (DLC) and immersion, if data centre operators are not actually monitoring the health of their fluids, they’re risking downtime.
“We’ve spent years and years actually designing single points of failure out of the data centre. We just put one back in,” he said.
To explain why this is case he first detailed why liquid cooling is inevitable now that there is a focus on how much energy data centres are using from governments and communities. He pointed out that BP research showed that data centre represented around 4% of total energy consumption but that will rise to 10% by 2030.
“Air is reaching its critical point; we can’t cool some of these GPUs with the TDPs that we’re now starting to see,” he said, adding that one kilowatt GPUs are going to be normal, then all the way through to the two kilowatt GPU. “The current generations of the NVL72 racks going out with the H200s they’re about 120-130 kilowatts. We’re obviously starting to see announcements around the 600 kilowatt further down the path,” he said.
He emphasised that air cooling has not gone away and is actually going to grow “quite significantly” but liquid cooling was coming out of specialist HPC and crypto environments into the mainstream.
Two liquid cooling approaches
Roberts outlined the two main types of liquid cooling technologies gaining traction in data centers: direct-to-chip (also known as cold plate cooling) and immersion cooling. Direct-to chip cooling uses water-based coolant circulated through cold plates attached to CPUs and GPUs, effectively removing 70–80% of the heat. It’s already supporting high-density racks ranging from 50 to 150 kilowatts, with megawatt-level racks being discussed among hyperscalers. He noted this method’s popularity stems from its compatibility with existing data centre setups and familiar maintenance routines, though it still relies on supplemental air cooling and involves complex infrastructure, including pumps, valves and CDU units.
Immersion cooling, on the other hand, submerges hardware in dielectric fluid, eliminating the need for air cooling entirely and achieving densities up to 375 kilowatts per tank. However, it faces challenges, particularly around server warranties, as many OEMs do not currently support their products being used in immersion setups. Only a few firms, like AI factory company Reset Data, are self-warranting their deployments, although he added that specialised immersion hardware is slowly entering the market.
Roberts also touched on fluid types used in these systems. Single-phase coolants remain liquid throughout operation and are preferred due to environmental concerns around two-phase systems, which although these offer superior heat rejection by cycling between liquid and gas states, they rely on refrigerants.
A new single point of failure?
Data centre operators face new challenges with liquid cooling around warranties, unfamiliar maintenance regimes, standards, material compatibility and so on, but the elephant in the room is that after spending many years designing single points of failure out of the data centre, adding fluid back in recreates a brand new single point of failure.
“When we’re talking about fluid in that degraded state, you potentially got early equipment failure, you’ve got potential corrosion issues, you’re going to end up reducing your cooling capacity,” he said. “And this isn’t going to happen all at once. So it’s really important that you’re monitoring that fluid, whether it’s a condition based monitoring system or your quarterly preventative maintenance.”
“When we’re talking about racks worth $5 million, it’s best to actually make sure that the fluid and everything that’s going through these, you can mitigate the risks,” he explained, highlighting how Computational Fluid Dynamics (CFD), once reserved for air cooling, is now being applied to liquid environments to model heat exchange and stress loads in advance.
Roberts outlined how water-based glycol coolants – common in direct-to-chip (DLC) systems – have higher viscosity than pure water, which can reduce flow rates and create zones of stagnation where problems like “corrosion” and “biofouling” occur. As part of his keynote he showed a picture of a cold plate with five micron channels suffering from biofouling. “So you can imagine what is actually happening in terms of the cooling capacity on that particular chip. Obviously not good.”
He emphasised the importance of understanding the chemistry of the cooling fluid. “The only certainty is, change; your fluid chemistry will change,” he said, citing a recent study by a US water treatment firm that found “79% [of samples] were out of scope” among 105 tested. Poor chemical compatibility can cause material degradation, such as “leaching of plastics,” a problem Roberts suggested was already affecting one social media platform operator.
Liquid volume and pressure
The volume and pressure involved in these new liquid systems also present major challenges.
“In immersion…a 100 kilowatt tank plus is about 1,300 litres. So it’s about 13,000 litres of oil in a mega facility. That’s quite a lot. So you need to mop it up. You need a lot of things there to look after it.”
While DLC systems use less fluid, they operate at much higher pressures and velocities. “The guys use around one and a half litres per minute per kilowatt,” Roberts said. “So…a 10 kilowatt server [needs] 15 litres per minute, to two and a half litres per second going through the manifold in the 100 kilowatt rack. Now… all of a sudden we’re talking 15 litres per second through a manifold. Nothing will ever go wrong, will it?”
With that volume and speed, handling spills and ensuring operational continuity become critical.
“How we’re going to get this fluid in, how we’re going to dispose of this fluid, how we’re going to handle things once spills happen…they will at some stage,” he warned. He also likened the need for spare fluids to traditional hardware spares: “We need to treat fluid as a spare part.”
Another major issue DC operators will need to deal with is the rapid response needed during GPU power spikes. “We’re seeing…GPU in-rush… about 1.7–1.8 [times peak power] for up to 50 milliseconds,” Roberts said. “The actual time for a GPU to burn out or start throttling back is only a matter of milliseconds if we don’t provide it the right flow rate and the right temperature.”
With AI workloads fluctuating from “30 megawatts through to 90 megawatts,” systems must be engineered for resilience. Roberts said some DC operators are now upsizing pipework – “10-inch pipe work instead of six”– and integrating “buffer tanks on the actual technical circuit.”
He rounded off the keynote by highlighting progress on fluid interoperability, using the example of BP’s own coolant, Castrol ON Direct Liquid Cooling PG 25, a propylene glycol-based cooling fluid designed explicitly for direct-to-chip cooling applications. As customers ask whether factory-loaded fluids can be mixed with others, BP has conducted “a lot of independent test work” with hyperscalers. Based on a “95% ratio with Castrol fluid to the other fluid,” Roberts confirmed, “this is possible,” with “no performance degradation” found in tests of chemical stability, corrosion, or additive dropout.
Mark Roberts presented the keynote “Critical considerations for liquid cooling deployments” at the Melbourne Cloud and Datacenter Convention in April 2025.
The Sydney Convention 2025: “Cloud & Datacenter in Transition” takes place at the Sydney International Convention Centre on 21 August 2025. To find out more and to register go to:
https://clouddatacenter.events/events/sydney-cloud-datacenter-convention-2025/
[Author: Simon Dux]