Automation Not The Solution For Human Error

We've all had our share of misfortunes with IT devices and services that have failed to perform as expected in an increasingly information-centric world. But as much as we may want to fault the technology, it appears that we are to blame in the majority of cases, at least as far as data-center outages. The solution is not to replace humans with lights-out automation, but provide better training, processes and procedures, says Julian Kudritzki, vice president of the Uptime Institute. "It's the sa

November 24, 2010

2 Min Read
Network Computing logo

We've all had our share of misfortunes with IT devices and services that have failed to perform as expected in an increasingly information-centric world. But as much as we may want to fault the technology, it appears that we are to blame in the majority of cases, at least as far as data-center outages. The solution is not to replace humans with lights-out automation, but provide better training, processes and procedures, says Julian Kudritzki, vice president of the Uptime Institute. "It's the same things over and over causing the failures, either the lack of processes, procedures and training, or the procedures are not followed."

The institute recently published the Operational Sustainability standard to address the human factor. According to a recent survey from the Ponemon Institute, 95 percent of U.S. data centers have had an unplanned outage.

Respondents averaged 2.48 complete data center shutdowns over the two-year period, with an average duration of 107 minutes. While complete shutdowns are frequent, row or rack-based outages had an average occurrence of 6.8 times with an average duration of 152 minutes. Rack-and server-based downtime had an average occurrence of 11.2 times during the two-year timeframe with an average duration of 153 minutes. While not the biggest factor, accidental EPO (emergency power off)/human error accounted for 51 percent of the outages.

Kudritzki says human error is in fact a bigger problem, accounting for up to 70 percent of data-center outages. The institute has been gathering data from over 100 of the largest most critical sites globally since 1994 (Abnormal Incident Reports), and with just under 5,000 reports in, including 500 on full data-center shut-downs, over 73 percent of events were attributed to human factors.

The problem of human error also seems to be worsening, he adds. "When looked at over the last one-and-a-half to two years, we've actually seen a slight uptick in process-related failures. There's a lot of work we need to do as an industry to address this."IT management is aware of the problems but they're overworked and under-funded, says Kudritzki. "The response has been we'll go with more automation. There are some slick solutions... (but) technology might not be the answer. It might come down to fundamentals, sufficient training and appropriate processes and procedures."

The first step in addressing this issue is to ascertain what you do and don't have, he says. "Once you're looked at what's there, determine whether it's effective and then sharpen your knives." The key is to run your data center like emergency, fire or submarine crews. "That's why you see so many ex-military in data-center operations... the military is a process and procedure environment."

Created to provide third-party research, education, and consulting focused on improving data center performance and efficiency, the Institute serves enterprise and third-party operators, manufacturers, providers, and engineers. Late last year it was acquired by IT market analyst firm 451 Group.

SUBSCRIBE TO OUR NEWSLETTER
Stay informed! Sign up to get expert advice and insight delivered direct to your inbox
More Insights