It took a decade for the data centre and hyperscaler industry revenue to hit $2.6bn but the prediction is that by 2025 – or in 8 months – spending on AI will reach $3.6bn. That’s just four years after AI really gained traction. The industry recently convened at the Melbourne Cloud and Data Centre Convention to map out what this might look like and one thing is for certain, to paraphrasethe immortal words of Police Chief Brody from Jaws : “You’re gonna need a bigger [capacity] data centre.”
“Make no mistake there’s a war on,” said Macquarie Data Centres VP of sales Gavin Dudley, who delivered the Jaws analogy. “There’s a war around who’s going to create the fastest, best chip, the fastest AI platform. AI is clearly not just about chips, it’s about a solution [and] the software associated with it. There’s more requirements coming.”
“It’s going to be more complex for our industry,” he said. “There will be data centres that will become obsolete; that can’t deal with the density of requirements coming our way.”
Data centre builders already factor in obsolescence given there’s billions of dollars of deployed capital when you build a data centre. Planning flexibility and scalability into new data centres to get ready to accommodate, frankly, something the industry doesn’t yet know how big it’s going to get, and what all of its parameters are, becomes something to keep operators awake at night.
“Historically data centre providers have had a [more] passive role in helping the clients into their data centre: ‘here is a hall, here is some power, go for your life’,” said Dudley. “We’re going to have to make a far more active role, a far more collaborative conversation with our clients about how we design this all into our data centres.”
AI will make data centres denser
Schneider Electric GM data centres, Mark Deguara, spelt out the coming densification of data centres looking to run AI. “An eight RU GPU setup is approximately 12kW…we’ve been talking about 100-150-200kW deployments in a rack,” he said, comparing that to a domestic oven in a similar footprint, which at 20 amps uses around 4.8kW. “That’s a lot of heat but also, how do you get power to it?”
This also means that with power and cooling, 8RU racks start pushing 1500-2000kg – a nightmare for raised floors. In a traditional data centre 1MW may be spread over 30-40 racks but with AI this is impractical – 8 racks could potentially be 1MW. The densification is necessary because going over as little as 900mm between compute, GPU storage increases latency and then you start to lose synchronisation between the processes in a calculation.
“Last week I was standing in a data hall, that data hall could accommodate 8MW. It was roughly 1000 sqm, could accommodate somewhere between 500 and 1000 racks depending how you deployed it. All of a sudden, I have 64 racks. So 64 racks would now accommodate that data hall,” said Deguara.
“From a sustainability perspective, that’s not very good as I’ve got all this concrete and steel sitting there doing nothing,” he said. “But I need the 8MW of power still. So all the infrastructure that’s sitting outside that facility would still be required. So if I’m building a new facility, all of a sudden my data hall, instead of 1000 sqm, is 200-300 sqm, but the infrastructure is still there.”
Deguara emphasised the power conundrum created when running 100+kW racks – essentially shifting to directly connected bus bars. “We kind of have to move to 415V as standard so everything is going to be in a three phase connection, meaning you might not have three phase directly to the piece of kit, you can actually have all three phases go into the unit,” he said.
“Small power distribution blocks won’t work; you’re going to have to increase your block size, rack PDUs…So your traditional back of the rack PDU, are 32, amp, even 63 amp are not going to be functional. You’re going to have to start moving to things like 100 kilowatt distribution, 200 kilowatt distribution,” he added.
AI will drive liquid-based cooling
Air cooling is good to about 40-50 kilowatts a rack and there are other forms of inline cooling that operators can also leverage, but above this, physics starts to get in the way. In the past operators could spread kit out, installing it in more and more racks across a larger footprint. But with the latency-intensive elements of the core of an AI platform and the cost of real estate – data centres are about USD 10-15 million per megawatt to build – operators need to find a way to cool these high density workloads.
“Above 35 kilowatts, we’re actually getting into 63 amp three phase PDUs to get 40 kilowatts for power, which means to as a minimum and then n plus one,” said Oper8 Global EVP global business Mike Andrea. “And we then get into a situation where the air velocity depending on the type of design, be it fan walls, be it in-row cooling units; the velocity of the systems start to create major problems with respect to how you run, maintain and operate these things.”
Oper8 Global has been delivering AI-enabled data centres in Europe for the past four years including for Formula One customers focused on HPC environments. Andrea said the DC design process was changing. “We actually see new impacts and part of that is a very big impact of availability and redundancy, but also resilience and real time uptime on your cooling systems,” he said. “When we’re talking about some of these new Nvidia platforms, now one second is enough to fry a box. When you’ve spent a 100 grand on it, it’s probably not a good idea to fry.”
Andrea said Australia was potentially four years behind Europe in its thinking with AI. “So part of this whole process is where to start…[it’s] really about taking your dream, simulating it creating it, testing it, running it, simulate it again, retest it and run through the process,” he said. “In the Australian marketplace we’re losing the dream.”
He explained this further by saying local thinking is that AI is only for the big companies and big cloud providers. “The reality is when we get down to some of these service models, and what you can do with AI, it’s actually for everyone,” he said. “We can actually bring it back into the enterprise, we can deploy it locally, we get deployed at the edge.”
“If we can’t dream it, we can’t think about it,” he said. “We don’t know what we’re going to do with it. So ‘what is possible?’ is really the best question to ask yourself about how do I use AI? What is actually becoming new in the Australian marketplace is the opportunity for Australia to actually catch up to [Europe].”
Oper8 has already deployed 125 kilowatt per rack data centres with direct-to-chip cooling in Europe and Andrea said it goes back to the realising the dream of what you want your facility to do. “We’ve had to deal with is massive amounts of power into those racks 125 kilo watt racks…not easy to do. Imagine having eight of them side by side at 1.1 MW?” he said. “We’ve deployed that; we’ve also cooled them.”
Not drowning, cooling
Macquarie’s Dudley said that GPU suppliers had told him immersion cooling can deliver up to 20% more GPU performance because the whole board is cooled and they have far fewer faults. However, he points out immersion is also costly and doesn’t really fit into today’s DC footprints. “In addition, it is difficult dealing with mineral oil and fibre optics are impacted by the oils refractive index,” he said.
Andrea said immersion cooling also hits a DC workflow. “[Moving] the equipment in and out? How do i let it drain, how do I move it to the next rack, how do I do asset management and tracking, how do I do tagging?” he said. “What comms network am I going to interconnect to it.”
“Some of the things that we have seen over time is IT equipment is getting deeper,” he said. “The biggest challenge we’re seeing some servers now, some storage arrays, are 1-1.1 metres deep. So trying to put that into an immersion tank and then find out in three years that the next one is, 1.2 meters deep…Let’s think about it. Let’s really forward plan what that actually means.”
Deguara said there was a lot of discussion around dielectric fluids instead of water. While they may be more efficient, the long-term immersion impacts are not well known – CFCs were a great refrigerant at the time, until they weren’t.
The alternative, direct to chip cooling has its own drawbacks. It is more complex the air, the plumbing is more complex, the demarcation handoff points around where the cooling is, is more complex and clearly not as effective at heat dissipation as immersion cooling.
“I think we’ll have data centres that are segregated a bit like how we have hot and cold aisle containment,” said Dudley. “We might end up in a scenario where we have AI containment so that part of the data centre does AI and that part of the data centre does something else to try to containerise where that deep impact is on our data centres.”
He added it was “bit disappointing” that the GPU manufacturers have really not set a standard around how they accept liquid cooling. He acknowledged Nvidia’s announcement that its Blackwell chips will be direct-to-chip cooled but said more needed to be done.
Andrea pointed out that Oper8 Global does offer both approaches depending on customer need. In addition to recently signing on with an Australian immersion tank provider, the company is now offering the two phase cooling solution with ZutaCore, which recently announced the ability to do chip cooling direct to Nvidia’s chips.
AI will be deployed at the network edge
While all the focus is on AI factories in 100+MW facilities, the reality is that new AI and DC models will emerge driven by compute workloads. “There might be millions of businesses using the cloud but there’s actually millions of locations with no cloud infrastructure,” said Andrea. “And when we look at AI and HPC, it sort of goes double. When we think about where there’s locations where we could actually infill locations that don’t have hyperscale data centre capacity, your ability to infill with HPC at the edge to augment what’s happening in your core or colo is an opportunity that
actually creates the ability to take the systems to the edge.”
He points out the sheer amount of bandwidth going between users and the actual HPC AI platforms demonstrates that the future of AI is distributed. “How much bandwidth would 15 largest cloud locations actually need to facilitate the world’s AI platforms?”
“AI, machine learning from HPC at the edge; it’s really about high compute workloads. It’s got GPU with a CPU it’s got ram, it’s got a disk and a scratch disk,” said Andrea. “But when we start talking about high speed data comms within the data centre, it actually does get down to more than a couple of metres, in some cases, HPC platforms and 900 millimetre maximum distance between compute, GPU and storage.”
Andrea warns that over-centralising AI will also have implications for Australia’s power grid. “I think that we’re going to run out of power in some of these cities just with the distribution network, the transmission network, each state interconnectors, the dependency on those into interconnectors to keep the cities working,” he said.
“Part of the opportunities we’ve seen is that by distributing some of the load through HPC at the edge you have the infill capability for clients who actually want to have that ML capability with AI locally, then take advantage of centralised cores with the HPC platform sitting in the hyperscale facilities. We’re actually distributing the load across the grid,” he added.
[Author: Simon Dux]