Building the data center for tomorrow’s AI

Author Info:
Paul Mah
Executive Editor, W.Media
Share This Article
Publish on W.Media
xAI Colossus Data Center Compute Hall | Image credit: ServeTheHome

AI is the new space race of today, driving advancements in technology and data center infrastructure. Just a short while ago, high-density data centers seemed like an elusive dream that never quite came to life. However, as AI models become more powerful, the need for specialised, high-density data centers is rapidly increasing.

Suddenly, we have an urgent demand for facilities that can support the immense processing power and energy efficiency required to handle AI workloads. What will the AI data center of the future look like?

The golden age of compute

In a groundbreaking paper titled “Scaling Laws for Neural Language Models” published in 2020, researchers concluded that AI models will scale predictably with increases in model size, data volume, and compute resources. In a nutshell, the paper demonstrated that larger AI models will continue to yield significant returns in capabilities. Over the last few years, this became one of the anchors for building ever-larger data centers and packing them with powerful AI hardware. The objective? To establish truly gigantic clusters of GPUs and push the boundaries of what AI can achieve.

As improvements start to wane on what is now known as “pre-training,” mainly due to a lack of adequate fresh data, leading AI experts have focused on various “post-training” techniques such as reinforcement learning and supervised learning. Such reinforcement systems help refine an AI model’s capabilities in specific areas to eventually become better at solving problems and performing complex tasks. On this front, DeepSeek, which shot to global fame early this year, focused heavily on reinforcement learning using automated feedback, eschewing human feedback to dramatically reduce the cost and training time required.

Another approach that has emerged is known as test-time computing, which sees AI models dynamically allocate resources during inference. In effect, the model will apply strategies during the inference phase to improve performance by generating multiple candidate outputs, evaluating them, and selecting the best answer. This technique is heavily used by the latest reasoning models, which leverage adaptive strategies to handle complex queries effectively.

Whether pre-training, post-training, or test-time computing, the fixation of the tech giants remains on improving AI performance. The cost? More computational resources are needed than ever to fuel these techniques, further driving demand for massive data centers crammed with powerful GPU infrastructure.

Rack density now matters

But does AI require a different kind of data center? Historically, AI infrastructure, despite its substantial energy demands, is easily accommodated within traditional data center environments. Nvidia guidelines for the H200 GPUs – still the mainstay today – emphasise infrastructure capable of delivering more than 40kW per rack. However, a typical strategy to support these GPUs is to spread the GPU servers across more racks and redistribute power in the data hall.

According to a 2023 study by the Uptime Institute, the average global rack density is less than 6kW per rack. So, expect a lot of almost-empty racks when deploying H200 GPUs in a low power density data center. It worked, albeit in a limited way, as long as there was enough floor space and cooling capacity. However, this approach is becoming increasingly untenable as AI workloads continue to grow, and new, higher-power GPUs such as Nvidia’s Blackwell GPUs appear on the market.

Power aside, there is another consideration: GPUs are specialised chips optimised for parallel processing. As the number of GPUs required for training the next generation of AI models increases, the challenge now lies not just in raw computational power, but in the intricate dance of inter-node communication. This is because AI training clusters need ultra-low latency and colossal bandwidth between compute nodes, so physical proximity is an important consideration. Moreover, high-speed local networking is also constrained by distance, which requires keeping racks filled up to maintain the shortest possible connection paths and minimise latency.

The result? Rack densities for AI data centers have now surged from traditional single-digit numbers into the double-digits or even the triple-digits range. One of the newest AI data centers would undoubtedly be the xAI Colossus Cluster, currently the world’s largest AI supercomputer with 100,000 GPUs across four data halls with plans to further scale it up.

The xAI Colossus Cluster

Tech giants are highly secretive about the internal workings of their technologies and data centers, and even more so for their AI deployments. However, Supermicro somehow managed to secure the approval of Elon Musk to publish a detailed walkthrough of Colossus, giving us a never-before-seen glimpse into the architecture and design of a groundbreaking AI deployment billed as the largest liquid-cooled AI cluster in the world.

Instead of the average of more than three years that traditional supercomputers require to plan and deploy, Colossus took a mere 122 days, or around four months, to be retrofitted and operationalised. AI training also started a mere 19 days after the first rack was installed – a record for large-scale data center deployments. So what are some notable aspects of a cutting-edge AI data center?

For a start, the facility leans heavily on both direct-to-chip and active rear door cooling. At the rack level, facility water goes into in-rack CDUs with redundant pumps and redundant power supplies. Multiple 1U manifold units channel the liquid via the front to various systems inside each rack. Within each GPU server, four Broadcom PCIe switches integrated into the motherboard are cooled by a custom liquid cooling block.

Chassis fans in each server ensure that components such as DIMMs, NICs, and management controllers are kept cool. Paired with liquid-cooled active rear door heat exchangers, this eliminates the need for traditional hot aisle containment, and keeps storage servers, CPU compute clusters, and 400Gb Ethernet networking equipment used for both GPU and non-GPU clusters cool. And yes, with up to nine networking ports to each server, the Colossus data halls are packed with an incredible amount of fibre optic cabling.

Interestingly, containers packed with Tesla Megapacks are used to buffer millisecond spikes and drops in power consumption amid AI training runs, as workloads are moved to the GPUs, results collated, and new jobs dispatched. This adds greater reliability to the AI data center deployment, according to Patrick Kennedy of ServeTheHome.

The push to go bigger with AI

The race to build AI data centers is only just picking up. In January 2025, OpenAI, Softbank and Oracle announced the Stargate Project, a joint venture to dramatically scale up AI infrastructure in the United States. The idea is to build the physical and virtual infrastructure needed to power the next generation of AI and keep the United States at the forefront of the AI race. Though it remains to be seen if the touted US$500 billion in investments will materialise, the announcement marks a significant commitment to advancing AI capabilities.

Other tech firms plan to invest massively in AI infrastructure, too, as they seek to build facilities that can handle unheard-of computational demands. For instance, Amazon has indicated it will spend some US$100 billion on technology infrastructure, with the bulk expected to go towards AI systems and data centers. On its part, Google plans to allocate US$75 billion to technical infrastructure, laying bare its ambitions to expand its AI initiatives such as Gemini AI models and AI services on Google Cloud.

Even Meta plans to invest up to US$65 billion to build huge AI data centers to train the next generations of its Llama foundation model, while Microsoft has earmarked a staggering US$80 billion for AI in its current fiscal year. And we have merely looked at a small handful of technology firms in the United States. Whichever perspective you take, AI investment in 2025 is reaching unprecedented levels and shows no signs of abating as companies vie for dominance in the AI sector.

The United States currently holds a significant lead in data center capacity. The construction of new AI data centers is expected to solidify this advantage, offering the necessary infrastructure to support rapid advancements in AI technologies.

The road ahead

Assuming an unlimited budget and unrestricted access to data center infrastructure, including GPUs, what would be the main obstacle for the AI data centers of the future? The answer is power. As AI models expand, the energy needed to train and maintain these models has become a significant challenge.

This is why tech giants are doubling down on fossil fuels and turning to energy sources to meet the energy demands of AI data centers. Often, they are engaging in large, multi-year agreements with power providers to secure more electricity, including from nuclear power plants. There are plans for renewables as well, and one hopes that these will eventually contribute a significant portion to powering these energy-hungry AI data centers.

As AI data centers continue to grow, challenges related to energy efficiency and thermal management are being tackled at every level, accelerating the development of innovative cooling solutions and sustainable energy practices for the data centers industry at large.

Ultimately, the AI data center of the future will require new innovations in power, cooling, and infrastructure. As the race for AI supremacy speeds up, success will not only depend on computational power but also on our ability to sustainably power and cool these digital giants, ensuring they operate efficiently.

 

*This article first appeared in the 7th issue of W.Media’s Cloud & Datacenters magazine. Click below to read and download a complimentary copy:

Related Posts
Other Popular Posts
South Asia News
August 2, 2022
South Asia News