Network Computing is part of the Informa Tech Division of Informa PLC

This site is operated by a business or businesses owned by Informa PLC and all copyright resides with them. Informa PLC's registered office is 5 Howick Place, London SW1P 1WG. Registered in England and Wales. Number 8860726.

10 Reasons Data Centers Fail

  • {image 1}

    An "unplanned data center outage" is a polite way to say that a data center failed. Whether the root cause is a hardware failure, software bug, or human error, most failures can -- and should -- be prevented. With the high level of redundancy built into today's data center architectures, prevention is very much possible.

    The interesting thing is, data center failures still happen all the time. Considering the incredible cost per minute lost during a full outage, you'd think that they would be far more rare. If data center managers simply focused on fixing the main reasons failures commonly occur, they would significantly reduce the risk of catastrophic outage.

    The problem is that so many data center operators are heavily focused on growth instead of the care and feeding of what's already in place. If you watch administrators in many public and private data centers these days, you'll find that they are focused largely on increasing capacity, boosting server density, and retrofitting aging server rooms into more modern facilities with more efficient cooling systems. While all this is fantastic and shows the incredible growth in the data center industry, it also highlights why we commonly see outages.

    On the following pages, we're going to get back to data center basics. We'll present 10 common reasons why data centers fail. Click through and think about how these common outages might one day surface in your data center. While not every failure scenario may match your data center architecture, we're confident that at least a few topics we mention will hit home and make you think about what you can do to shore up your facility.

    And if you have any additional thoughts, tips, or stories that may help your fellow administrators avoid an outage, please share them in the comments below.

    (Image: 123Net / Wikimedia Commons)

  • Improper System Authorization

    Very few administrators, if any, should have full and unrestricted authorization to access all systems in a data center. Instead, access should be tightly regulated. If not, you may end up with outages, similar to the one Joyent experienced in 2014, when an admin unwittingly rebooted all virtual machines in the company's eastern data center with a few clicks of a mouse.

    (Image: Grassetto/iStockphoto)

  • Poor Fallback Procedures

    When planning for maintenance windows, the step that is most often neglected is the fallback procedure. Usually, the process documented is not thoroughly vetted and fails to fully revert all changes back to original form.

    (Image: ClkerFreeVectorImages / Pixabay)

  • Making Too Many Changes

    During maintenance windows, it can be problematic if administrators attempt to make too many changes at once. First, administrators become rushed, as they have to complete a large number of tasks in a small period of time; this often leads to mistakes. Second, because so many changes are occurring within the same timeframe, it makes troubleshooting post-change problems a far more difficult task.

    (Image: mtreasure/iStockphoto)

  • Insufficient, Old, Or Misconfigured Backup Power

    The most common reason a data center goes down is due to a power failure. Power outages happen all the time. Because of this, data centers are designed with redundant power sources in case their primary source goes away. Battery and/or generator power is commonly used as a backup source. The problem is, batteries aren't replaced in a timely manner, generators aren't tested, and power failure tests are not performed. All of these oversights mean your redundancy power may not be available when you need it the most.

    (Image: HebiFot / Pixabay)

  • Cooling Failures

    It's mind-boggling how much heat data centers generate. That's why cooling is so critical. A facility that could feel like an icebox one minute can become a sweltering furnace the next; it really does happen that fast. And even when you have temperature sensor readings and alerts sent to admins, you have to make sure you have sufficient time to implement your backup cooling procedures before everything melts down.

    (Image: Hans / Pixabay)

  • Malfunctioning Automated Failover Procedures

    Most service providers and enterprise organizations have a backup data center that mirrors the production data center. In the event of a major outage at the primary site, automated procedures kick in and move all traffic to the backup facility. If done properly, the process should be seamless to end users. Unfortunately, automated failovers often don't work as expected. The usual cause for the failure is lack of regular testing. Even minor changes within the production infrastructure can have major impacts on automated failover processes. So when infrastructure changes are made, automated failover procedures should be tested to make sure nothing broke.

    (Image: 742680 / Pixabay)

  • Changes Outside Maintenance Windows

    If you've ever worked in a data center, you've likely been in the situation where a request comes in to make a minor change to a server or piece of network equipment. And while data center protocol technically requires you to run this through the change-control committee, you feel it can easily be made outside of a formal change-control process and maintenance window. And 99 out of 100 times, you're absolutely correct. But every once in a while, a minor change has unexpected consequences. The end result is an unexpected outage and a data center administrator who is in hot water.

    (Image: StevePB / Pixabay)

  • Hanging Onto Legacy Hardware

    All hardware is going to fail at some point. And the longer you keep hardware, the more likely it is to fail. Everyone knows this, yet very often a critical application goes down because it was running on 10-year-old hardware. These problems commonly arise due to a lack of a comprehensive migration plan onto a new hardware or software platform -- or lack of budget. If it's a money problem, there's not much you can do. But if you're simply dragging your feet, get to work removing as much legacy hardware as possible before it dies.

    (Image: Ed Costello / Flickr)

  • Wet Fire-Suppression Systems

    Most modern data centers use non-water fire-suppression systems so they don't damage equipment if purposefully or accidentally triggered. But many older facilities still use wet fire-suppression systems in their data centers. Water leaks and accidental triggers of wet suppression systems have caused major outages. Proper maintenance and the inclusion of pre-action single or double-interlock systems help minimize risk.

    (Image: Closa / Pixabay)