Yet most companies have overlooked the most obvious way to reduce outages, said Eric Hanselman, chief analyst at 451 Research.
"Any time an enterprise is looking to make sure it has solid availability, the top three things it should be tackling are monitoring, monitoring and monitoring," Hanselman said during a phone interview. "I'm amazed by the number of enterprises that haven't invested much in this area."
Despite the fact that applications have to be solid enough to meet user expectations around the clock, many companies rely on employees to be their monitors.
"If you're learning about failures from your users, you have big problems," he said.
It doesn't have to be that way, as experts say that there are clear approaches to preventing the kinds of incidents that cause so many data center outages, and lessons to be learned from companies that have been victimized.
For instance, Ponemon's research indicates that the three most common causes of unplanned outages are UPS equipment failure, human error and distributed denial-of-service attacks. In each case, Hanselman suggests an approach for reducing the likelihood of an outage:
• UPS equipment failures can be reduced through more intelligent systems design, he said. More specifically, he urges companies to spend whatever they can on redundancy to avoid the impact of an inevitable equipment failure. "You have to approach it with a realistic understanding of what the costs of an outage would be to your business," he said.
• Better automation is the best tonic for proactively preventing user errors. Even though automation is still built by users, and is thus prone to errors, ensuring that systems automatically follow policies and procedures can only help.
"No organization out there should be manually changing anything more than a very small handful of what its infrastructure consists of," Hanselman said. "Routine tasks, bringing up systems, and configuring and managing them, should all be automated."
• Finally, when it comes to preventing DDoS attacks, it's all about making security investments. Attackers have gotten so sophisticated, Hanselman said, that they can make a server look like it's got plenty of availability even if it's been corrupted and is actually doing nothing.
Special attention should be paid to protecting the customer experience. "You've to make sure that path from your end customer out to the entire interaction experience is protected," he said.
[The Ponemon Institute study found that data center operators are overwhelmed by outages and aren't sure of their ability to minimize their impact. Read the details "Data Center Outages Haunt IT Pros."]
Meanwhile, Kelly Quinn, a research manager who monitors data center trends for IDC, believes a couple of recent incidents provide important reminders to companies that want to avoid similar fates.
First, there was the epic and well-documented struggles of the Healthcare.gov site that is the central hub for Obamacare. Quinn said via email that a few weeks after the initial site outage occurred in October, she attended an investment conference at which a spokesperson for Verizon -- its Terremark subsidiary provides the site's underlying infrastructure -- admitted that the demand the site experienced was far in excess of what had been scoped.
In other words, said Quinn, the fiasco "could have been avoided had the Department of Health and Human Services better estimated the demand," thus enabling Terremark to prepare accordingly.
Another large outage that holds implications for many enterprises is the one that hit Turbine, Warner Bros.' online game unit, last month, knocking out the company's Lord of the Rings game for a large chunk of a day. Details of the outage's cause are scarce, Quinn said, but the message to data center operators was clear.
"Any company operating data centers should have the requisite backup battery power to keep the servers running for up to 15 minutes while the company works to a) get the main power back online, and b) get backup generators online in case main power is not available," she said.
How is your organization attempting to prevent data center outages? Please share your experience and advice in the comment section below.