Data Center Outages: Expensive. Disruptive. Preventable?

On October 14, 2023, a data center outage brought digital banking and financial services to a grinding halt in Singapore. Customers of at least two major banks were left in the lurch, as credit card transactions were declined, internet banking and POS terminals stopped working, ATMs froze, and one lost access to their own hard-earned money!

As per a submission made before the Parliament by Minister of State for Trade and Industry, Alvin Tan, nearly 2.5 million payments and ATM transactions could not be completed due to the outage, and customers made as many as 810,000 failed attempts to access digital banking.

Subsequent investigation revealed that the outage at Equinix’s SG1 data center occurred when a third-party contractor incorrectly sent a signal to close the valves from the chilled water buffer tanks, which caused temperatures to rise. The signal was sent during a planned system upgrade.

But this is neither the first, nor the most recent instance of a data center outage. For the purposes of this article, let us restrict ourselves to some of the major outages that took place in 2023 alone.

In February, a power surge at a utility data center caused cooling units at a Microsoft data center to go offline, leading to an outage. It affected customers hosted at the South East Asia cloud region, and affected digital services provided by Singapore’s Central Provident Fund Board. It also impacted the websites of Esplanade and Nanyang Technological University (NTU).

And it’s not just Asia. In May, a data center outage in the United States, forced the Ascension via Christi health facility in Wichita, Kansas, to pause most surgeries and procedures for several hours.

In November, Cloudflare’s control plane and analytics services experienced disruption when, as per an official statement, one of their “core data center providers failed catastrophically.” This facility called PDX-DC04 was one of their three data centers in Hillsboro, Oregon.

“A data outage has many dire consequences – from operational downtime to financial losses and reputational damage. One of the greatest concerns lies in compromised data security or privacy,” says Kumar Mitra, MD & Regional GM – CAP, Lenovo ISG. “Large-scale data centers, especially those supporting critical infrastructure and managing extensive data, often face more significant losses during outages. Industries with strict uptime demands, such as finance and healthcare, may experience higher financial and operational repercussions from even brief downtimes compared to other sectors,” he says.

So, while outages are not uncommon, it is rather unfortunate that the world only takes notice when something goes wrong. What are we missing? What are we failing to see? Why can’t we seem to prevent these expensive and disruptive outages? To answer these questions, we have to first go to the root of what causes outages.

Why do outages occur and how can we prevent them?

According to Vyacheslav Chvoro, Head of Research Department at UnaFinancial, “The main cause of significant outages at DCs are power problems. Other causes include cooling failures, IT software or system errors and network issues.”

“ Many data centers overlook the importance of both prevention and disaster recovery. It’s essential to adopt a mindset that acknowledges the possibility of outages affecting anyone,” says Mitra.

Sanjay Motwani (Vice President – APAC, Business Head – Legrand Data Center Solutions, India), says, “If there is a challenge, it will emerge from the rack.” He advocates for investment in better data center infrastructure instead of having a “we’ll manage somehow” attitude. “It requires as much attention as the network and compute sides. Because if the infrastructure fails, or even if a part of it fails, then everything you have built on it, fails.”

But data centers are already investing in Uninterruptible Power Supply (UPS) systems. So, what more can be done? Chvoro says, “It is important to constantly assess facility resilience with the help of monitoring solutions that deliver real-time information about the data center environment and the likelihood of problems. Constant monitoring of air conditioning, heating, and water might help reduce the risk of outages.” He also suggests, “It is necessary to keep updating software and applying patches on a regular basis. To ensure regular patching of updates, AI can be used to run scans for vulnerabilities and proactively identify issues related to data center equipment or application performance or security. Network-related outages can be prevented by using a combination of proactive network monitoring and minimizing the possibility of human errors using automation. It is also advisable to have network redundancy, which means that if one network fails, an alternative network with a different service provider is available.”

Motwani also bats for greater automation, saying, “Automation will have to come in to minimise human error, which is the second biggest reason behind outages.”

Mitra points to yet another benefit of AI, saying, “AI enhances threat detection, turning real-time data into actionable insights for effective cyber threat mitigation.”

Industry standards and guidelines

It is important to note here that when it comes to guidelines, some markets have a detailed framework encompassing tools to prevent, detect and manage outages.

For example, Chvoro points to data center certification by the Uptime Institute. “The certification process is based on assessment of fault tolerance of the center and protection from potential failures. Upon the assessment, the data center is assigned a level of reliability, which is called a Tier.”

He further says, “In the Asia-Pacific region, countries are paying particular attention to regulations of cybersecurity to counter growing cyber threats targeting data centers, for example, the Cybersecurity Act in Singapore or the Privacy Act in Australia.” Chvoro adds, “Since Singapore has a reputation for being a major data center hub and holds 60% of the Southeast Asian data market, it has formulated various laws. Complying with them, data centers can improve the quality of service for end users and protect their data.”

Some key laws and guidelines in APAC:

  • The Personal Data Protection Act of 2012 (PDPA) which regulates the collection, use and disclosure of personal data by private organizations.
  • Multi-Tier Cloud Security (MTCS) also known as Singapore Standard 584, which is the world’s first standard for Cloud security.
  • Technical Reference (TR) 62 for Cloud outage incident response (COIR), which is a set of recommendations that helps correct the situation quickly in case of operational failures of clouds.
  • Singapore  Standard (SS) ISO/IEC 21878:2019, which addresses security of the growing virtualization of data center infrastructure.

But having regulations is one thing, and following them is quite another. So, will we see fewer outages in 2024? Have we learnt the lessons we needed to learn? Watch this space for more…

*** Thirticle first appeared in the latest issue of W.Media’s Cloud & Datacenters magazine. Click the image below to download your complimentary copy:

Publish on W.Media
Author Info:
Picture of Deborah Grey
Deborah Grey
Share This Article
Related Posts
Other Popular Posts
Uncategorized