Operators sometimes make common mistakes that can lead to data center outages. Most outages can be avoided through proper maintenance, procedures, and common sense.
-
{image 1}
An "unplanned data center outage" is a polite way to say that a data center failed. Whether the root cause is a hardware failure, software bug, or human error, most failures can -- and should -- be prevented. With the high level of redundancy built into today's data center architectures, prevention is very much possible.
The interesting thing is, data center failures still happen all the time. Considering the incredible cost per minute lost during a full outage, you'd think that they would be far more rare. If data center managers simply focused on fixing the main reasons failures commonly occur, they would significantly reduce the risk of catastrophic outage.
The problem is that so many data center operators are heavily focused on growth instead of the care and feeding of what's already in place. If you watch administrators in many public and private data centers these days, you'll find that they are focused largely on increasing capacity, boosting server density, and retrofitting aging server rooms into more modern facilities with more efficient cooling systems. While all this is fantastic and shows the incredible growth in the data center industry, it also highlights why we commonly see outages.
On the following pages, we're going to get back to data center basics. We'll present 10 common reasons why data centers fail. Click through and think about how these common outages might one day surface in your data center. While not every failure scenario may match your data center architecture, we're confident that at least a few topics we mention will hit home and make you think about what you can do to shore up your facility.
And if you have any additional thoughts, tips, or stories that may help your fellow administrators avoid an outage, please share them in the comments below.
(Image: 123Net / Wikimedia Commons)
-
Improper System Authorization
Very few administrators, if any, should have full and unrestricted authorization to access all systems in a data center. Instead, access should be tightly regulated. If not, you may end up with outages, similar to the one Joyent experienced in 2014, when an admin unwittingly rebooted all virtual machines in the company's eastern data center with a few clicks of a mouse.
(Image: Grassetto/iStockphoto)
-
Poor Fallback Procedures
When planning for maintenance windows, the step that is most often neglected is the fallback procedure. Usually, the process documented is not thoroughly vetted and fails to fully revert all changes back to original form.
(Image: ClkerFreeVectorImages / Pixabay)
-
Making Too Many Changes
During maintenance windows, it can be problematic if administrators attempt to make too many changes at once. First, administrators become rushed, as they have to complete a large number of tasks in a small period of time; this often leads to mistakes. Second, because so many changes are occurring within the same timeframe, it makes troubleshooting post-change problems a far more difficult task.
(Image: mtreasure/iStockphoto)
-
Insufficient, Old, Or Misconfigured Backup Power
The most common reason a data center goes down is due to a power failure. Power outages happen all the time. Because of this, data centers are designed with redundant power sources in case their primary source goes away. Battery and/or generator power is commonly used as a backup source. The problem is, batteries aren't replaced in a timely manner, generators aren't tested, and power failure tests are not performed. All of these oversights mean your redundancy power may not be available when you need it the most.
(Image: HebiFot / Pixabay)
-
Cooling Failures
It's mind-boggling how much heat data centers generate. That's why cooling is so critical. A facility that could feel like an icebox one minute can become a sweltering furnace the next; it really does happen that fast. And even when you have temperature sensor readings and alerts sent to admins, you have to make sure you have sufficient time to implement your backup cooling procedures before everything melts down.
(Image: Hans / Pixabay)
-
Malfunctioning Automated Failover Procedures
Most service providers and enterprise organizations have a backup data center that mirrors the production data center. In the event of a major outage at the primary site, automated procedures kick in and move all traffic to the backup facility. If done properly, the process should be seamless to end users. Unfortunately, automated failovers often don't work as expected. The usual cause for the failure is lack of regular testing. Even minor changes within the production infrastructure can have major impacts on automated failover processes. So when infrastructure changes are made, automated failover procedures should be tested to make sure nothing broke.
(Image: 742680 / Pixabay)
-
Changes Outside Maintenance Windows
If you've ever worked in a data center, you've likely been in the situation where a request comes in to make a minor change to a server or piece of network equipment. And while data center protocol technically requires you to run this through the change-control committee, you feel it can easily be made outside of a formal change-control process and maintenance window. And 99 out of 100 times, you're absolutely correct. But every once in a while, a minor change has unexpected consequences. The end result is an unexpected outage and a data center administrator who is in hot water.
(Image: StevePB / Pixabay)
-
Hanging Onto Legacy Hardware
All hardware is going to fail at some point. And the longer you keep hardware, the more likely it is to fail. Everyone knows this, yet very often a critical application goes down because it was running on 10-year-old hardware. These problems commonly arise due to a lack of a comprehensive migration plan onto a new hardware or software platform -- or lack of budget. If it's a money problem, there's not much you can do. But if you're simply dragging your feet, get to work removing as much legacy hardware as possible before it dies.
(Image: Ed Costello / Flickr)
-
Wet Fire-Suppression Systems
Most modern data centers use non-water fire-suppression systems so they don't damage equipment if purposefully or accidentally triggered. But many older facilities still use wet fire-suppression systems in their data centers. Water leaks and accidental triggers of wet suppression systems have caused major outages. Proper maintenance and the inclusion of pre-action single or double-interlock systems help minimize risk.
(Image: Closa / Pixabay)
-
Accidental Activation Of Emergency Power-Off
The high levels of physical security implemented at most data centers aren't simply to keep thieves out. They are also in place to keep out employees who have no understanding of how a data center works. All too often, data centers go down because an application administrator waltzes into the data center and accidently trips the emergency power-off (EPO). The EPO is a big red button that cuts power to the entire facility. And apparently, for those who don't understand what it does, the impulse to push it is irresistible.
(Image: Antagain/iStockphoto)


Comments
marciasavage
User Rank: Guru
Tue, 11/10/2015 - 10:52
Andrew, thanks for all these tips and common sense guidance. Are there tools that can help manage some tasks, like reminders on battery replacements and power failure testing?
Network Computing
User Rank: Apprentice
Tue, 11/10/2015 - 11:20
Enterprise network monitoring tools have the ability to alert on battery replacements after a set time elapsed. But for the most part, these types of things can be handled with a shared outlook calendar and a data center manager with the proper mindset of proactive maintenance.
Network Computing
User Rank: Apprentice
Wed, 11/11/2015 - 17:09
Power failure testing is something where stakeholders nerves dont flow well, some DC owners dont perform these type of testing in fear of outage. I remember when i was working for small sized datacenter in my early days where such infra testing were hardly taken into consideration and it was not due to cost but due to fear of outage or loosing revenue traffic.
marciasavage
User Rank: Guru
Thu, 11/12/2015 - 07:13
I see, thanks for sharing that virsingh. Sounds like the testing is kind of a double-edged sword then.
Network Computing
User Rank: Apprentice
Fri, 11/13/2015 - 00:34
Thank you for this interesting article.
I learn a lot about Data Centers :)
marciasavage
User Rank: Guru
Fri, 11/13/2015 - 14:48
Glad you liked this @frankun! Thanks for chiming in.