How To Avoid #fail In Storage

In the odd world that is Twitter, #fail is a tag you put on your tweet when something goes wrong in your life, at your job or when flying your least favorite airline. What do you do to avoid #fail in your storage infrastructure? The most important thing you can do when dealing with storage failure is to make sure you are prepared for something to go wrong before it ever happens.

George Crump

July 16, 2010

3 Min Read
Network Computing logo

In the odd world that is Twitter, #fail is a tag you put on your tweet when something goes wrong in your life, at your job or when flying your least favorite airline. What do you do to avoid #fail in your storage infrastructure? The most important thing you can do when dealing with storage failure is to make sure you are prepared for something to go wrong before it ever happens.

You're in IT. It is not a matter of if something will fail it is a matter of when it will fail. The number one thing that you can do to make sure you are prepared for a failure is to know what you have in that infrastructure. Whether you try to fix the problem yourself, or if you bring in an expert, the first thing that people are going to ask for is an inventory of what you have so diagnosis can begin.

An inventory is not the latest copy of your data center diagram that you have spent hours on. While a good start, this really does not give the details that someone is going to need to begin diagnosing the problem. What is needed is a detailed configuration of every HBA, switch port, inter-switch link (ISL), how the storage ports are configured and of course how the storage itself is configured.

It is also best if this information is captured frequently, preferably in real time by some sort of analysis tool (in other words, not in a spreadsheet). Spreadsheets are not IT diagnostic tools. We've seen troubleshooting projects where the inventory spreadsheet was more than six months old and not updated since before the server virtualization project was started. Things had changed. Candidly, if your inventory is more than a few weeks old, especially in a virtualized environment, you probably shouldn't bother having one. A re-inventory is going to have to be performed, so you are better off just budgeting for that every time a problem arises in the environment. The value of real time capture is it can provide clues of what was changing in the environment in the time leading up to the failure event. Those changes can often provide a clue to what went wrong. Often these tools can capture physical errors being logged by the system which again can provide some insight into what went wrong. Most importantly though, real time capture can help you prevent a #fail before it ever happens.

The problem with most infrastructure hardware, storage hardware and their software components is not that they don't provide enough diagnostic information, but that they provide too much, and as a result, the important information is lost in the shuffle. What these tools can do is highlight when a message really needs your attention or when a combination of slightly related messages are indicative of a failure. There is plenty more to do beyond developing an accurate inventory to help get through a storage failure, but knowing what you have is a critical first step.

About the Author(s)

SUBSCRIBE TO OUR NEWSLETTER
Stay informed! Sign up to get expert advice and insight delivered direct to your inbox
More Insights