Human errors are the biggest cause of data center downtime, according to vendors and users at the NDCF-sponsored Data Center Forum 2004 in New York this week.
Research presented by American Power Conversion Corp. revealed that operator errors, along with shortcomings in data center design and construction, are responsible for more than 60 percent of all data center failures. Equipment failures accounted for around one third of outages, and external causes such as floods and earthquakes were responsible for just over 5 percent of downtime.
Data center failure is a sensitive subject; one IT manager attending the conference agreed to talk to NDCF on condition that his name was not published. Nonetheless, he was not exactly shocked by APCs findings. Data centers are a very dynamic environment. Theres a lot happening, and not everyone takes the time to update their documentation, he says.
"Documentation" refers to the manuals for the different pieces of the data center kit, which are usually kept in a central directory that staff can refer to. The IT manager says some data centers are better at keeping these directories up to date than others. However, he admits that staff turnover and lack of experienced ITers contribute to the problem.
Neil Rasmussen, APCs founder and CTO, cites training as an area that could be improved to help reduce the risk of human error. Common standards such as the data center markup language (DCML) will also help by standardizing systems management, he adds, although this is still some way off.