10 Reasons Data Centers Fail

{image 1}

An "unplanned data center outage" is a polite way to say that a data center failed. Whether the root cause is a hardware failure, software bug, or human error, most failures can -- and should -- be prevented. With the high level of redundancy built into today's data center architectures, prevention is very much possible.

The interesting thing is, data center failures still happen all the time. Considering the incredible cost per minute lost during a full outage, you'd think that they would be far more rare. If data center managers simply focused on fixing the main reasons failures commonly occur, they would significantly reduce the risk of catastrophic outage.

The problem is that so many data center operators are heavily focused on growth instead of the care and feeding of what's already in place. If you watch administrators in many public and private data centers these days, you'll find that they are focused largely on increasing capacity, boosting server density, and retrofitting aging server rooms into more modern facilities with more efficient cooling systems. While all this is fantastic and shows the incredible growth in the data center industry, it also highlights why we commonly see outages.

On the following pages, we're going to get back to data center basics. We'll present 10 common reasons why data centers fail. Click through and think about how these common outages might one day surface in your data center. While not every failure scenario may match your data center architecture, we're confident that at least a few topics we mention will hit home and make you think about what you can do to shore up your facility.

And if you have any additional thoughts, tips, or stories that may help your fellow administrators avoid an outage, please share them in the comments below.

(Image: 123Net / Wikimedia Commons)
Improper System Authorization

Very few administrators, if any, should have full and unrestricted authorization to access all systems in a data center. Instead, access should be tightly regulated. If not, you may end up with outages, similar to the one Joyent experienced in 2014, when an admin unwittingly rebooted all virtual machines in the company's eastern data center with a few clicks of a mouse.

(Image: Grassetto/iStockphoto)
Poor Fallback Procedures

When planning for maintenance windows, the step that is most often neglected is the fallback procedure. Usually, the process documented is not thoroughly vetted and fails to fully revert all changes back to original form.

(Image: ClkerFreeVectorImages / Pixabay)
Making Too Many Changes

During maintenance windows, it can be problematic if administrators attempt to make too many changes at once. First, administrators become rushed, as they have to complete a large number of tasks in a small period of time; this often leads to mistakes. Second, because so many changes are occurring within the same timeframe, it makes troubleshooting post-change problems a far more difficult task.

(Image: mtreasure/iStockphoto)
Insufficient, Old, Or Misconfigured Backup Power

The most common reason a data center goes down is due to a power failure. Power outages happen all the time. Because of this, data centers are designed with redundant power sources in case their primary source goes away. Battery and/or generator power is commonly used as a backup source. The problem is, batteries aren't replaced in a timely manner, generators aren't tested, and power failure tests are not performed. All of these oversights mean your redundancy power may not be available when you need it the most.

(Image: HebiFot / Pixabay)
Cooling Failures

It's mind-boggling how much heat data centers generate. That's why cooling is so critical. A facility that could feel like an icebox one minute can become a sweltering furnace the next; it really does happen that fast. And even when you have temperature sensor readings and alerts sent to admins, you have to make sure you have sufficient time to implement your backup cooling procedures before everything melts down.

(Image: Hans / Pixabay)
Malfunctioning Automated Failover Procedures

Most service providers and enterprise organizations have a backup data center that mirrors the production data center. In the event of a major outage at the primary site, automated procedures kick in and move all traffic to the backup facility. If done properly, the process should be seamless to end users. Unfortunately, automated failovers often don't work as expected. The usual cause for the failure is lack of regular testing. Even minor changes within the production infrastructure can have major impacts on automated failover processes. So when infrastructure changes are made, automated failover procedures should be tested to make sure nothing broke.

(Image: 742680 / Pixabay)
Changes Outside Maintenance Windows

If you've ever worked in a data center, you've likely been in the situation where a request comes in to make a minor change to a server or piece of network equipment. And while data center protocol technically requires you to run this through the change-control committee, you feel it can easily be made outside of a formal change-control process and maintenance window. And 99 out of 100 times, you're absolutely correct. But every once in a while, a minor change has unexpected consequences. The end result is an unexpected outage and a data center administrator who is in hot water.

(Image: StevePB / Pixabay)
Hanging Onto Legacy Hardware

All hardware is going to fail at some point. And the longer you keep hardware, the more likely it is to fail. Everyone knows this, yet very often a critical application goes down because it was running on 10-year-old hardware. These problems commonly arise due to a lack of a comprehensive migration plan onto a new hardware or software platform -- or lack of budget. If it's a money problem, there's not much you can do. But if you're simply dragging your feet, get to work removing as much legacy hardware as possible before it dies.

(Image: Ed Costello / Flickr)
Wet Fire-Suppression Systems

Most modern data centers use non-water fire-suppression systems so they don't damage equipment if purposefully or accidentally triggered. But many older facilities still use wet fire-suppression systems in their data centers. Water leaks and accidental triggers of wet suppression systems have caused major outages. Proper maintenance and the inclusion of pre-action single or double-interlock systems help minimize risk.

(Image: Closa / Pixabay)

Juniper Networks Announces AI-Native Networking Platform

Zeus Kerravala, Founder and Principal Analyst with ZK Research

January 31, 2024

Bob Friday, Chief AI Officer for Juniper Networks, explains how the advanced technology is transforming operations.

Understanding Why Contact Center Agent Empowerment is Critical to a Great Customer Experience

Zeus Kerravala, Founder and Principal Analyst with ZK Research

January 29, 2024

Contact center leaders from 8x8, Awaken Intelligence, and 360insight discuss the importance of agent experience.

AI Drives the Ethernet and InfiniBand Switch Market

David Curry, Technology Writer

January 27, 2024

AI may force enterprises to rewire parts of their data centers so they are fully optimized to run such workloads. The question is do you use Ethernet or InfiniBand?

10 Reasons Data Centers Fail

Improper System Authorization

Poor Fallback Procedures

Making Too Many Changes

Insufficient, Old, Or Misconfigured Backup Power

Cooling Failures

Malfunctioning Automated Failover Procedures

Changes Outside Maintenance Windows

Hanging Onto Legacy Hardware

Wet Fire-Suppression Systems

Tags:

Recommended For You

Juniper Networks Announces AI-Native Networking Platform

Understanding Why Contact Center Agent Empowerment is Critical to a Great Customer Experience

AI Drives the Ethernet and InfiniBand Switch Market

Search form

10 Reasons Data Centers Fail

Improper System Authorization

Poor Fallback Procedures

Making Too Many Changes

Insufficient, Old, Or Misconfigured Backup Power

Cooling Failures

Malfunctioning Automated Failover Procedures

Changes Outside Maintenance Windows

Hanging Onto Legacy Hardware

Wet Fire-Suppression Systems

Tags:

Recommended For You

Juniper Networks Announces AI-Native Networking Platform

Understanding Why Contact Center Agent Empowerment is Critical to a Great Customer Experience

AI Drives the Ethernet and InfiniBand Switch Market