01:34 PM
Howard Marks
Howard Marks
Repost This

False Disk Drive Failures Are a Real Problem

New information shows disk drives report false failures with alarming frequency. False failures have real costs for data center operators. Here’s what the industry can do about it.

Disk drive vendors have been telling us for years that more than half the drives that get returned to them for warranty repairs fall into the category they call NPF--no problem found. As an IT professional, I assumed that the true cause of this phenomenon, like many of the problems that beset the helpdesk, was located between the keyboard and chair. A recent blog post by LSI's Rob Ober shines new light on the subject of false drive failures and has me wondering why this problem persists.

Ober notes that false failures are a major problem not just for hobbyists that buy bare drives from Fry's and Newegg but also for major data center operators. Data center operators, like thee and me, have substantial costs when a drive fails. For instance:

• System performance drops off, often for days, as the RAID system rebuilds as much as 4TB of data onto a hot spare drive. In distributed environments using scale-out storage, this also affects network traffic as the rebuild data has to be consolidated across multiple storage nodes.

• Someone has to go change the drive.

• Because the drive has sensitive corporate data, it has to be sanitized or destroyed. If you're not big enough to have an agreement with your storage vendor to replace failed drives on your say-so, it may mean you also have to pay for the replacement as you can't return drives.

The problem is that today's disk drives are run by internal microcontrollers that have firmware. Just like your PC or Mac, that software occasionally gets confused and the processor freezes. The drive hits a series of requests and states that weren't completely debugged in the development process and its processor stops responding to commands from the host or RAID controller.

If a host or RAID controller reports such a drive as failed, that drive will work just fine when removed from the host and tested elsewhere. (We all know turning off and turning on the power solves a lot of computing problems.) In fact studies have shown that drives that have suffered this kind of false failure are just as reliable, after they get a reset, as new drives fresh from the factory.

Mr. Ober actually got a large data center operator, who remains nameless, to share its drive failure statistics with him. This datacenter, while small by Google or Facebook standards, is pretty huge with over 200,000 servers.

They found:

• 30+% of their SAS drive failures are false, adding up to 10-15 a day or a 1/1000 annual false failure rate.

• SATA drives, directly connected to server motherboards, have an even higher false failure rate, approaching the 50% number that drive vendors have long reported, and a frightening 1% annual false failure rate.

A few vendors have tried to address the problem. Five years ago Xiotech and Atrato were talking up "self-healing" disk arrays that would perform repair tasks rather than starting a RAID rebuild immediately when a drive stopped responding to commands. Xiotech, working closely with Seagate, could even keep running a drive with a damaged surface or failed head by mapping accesses around it. Of course the first step in this recovery process was to perform a hard reset on the drive.

[ Join us at Interop Las Vegas for access to 125+ IT sessions and 300+ exhibiting companies. Register today! ]

With the industry turning its fickle attention to flash, self-healing arrays aren't cool anymore. Atrato has gone the way of all flesh and Xiotech, now re-named X-IO has faded in relevance as its last independent competitors Compellent, 3Par and even Nexsan were acquired.

Because the disk drive market is essentially a duopoly selling high-volume, low-margin products, I don't expect Seagate or Western Digital to build a highly redundant circuit board into drives that could detect false failures and reset itself. However there are a few things industry players, including LSI, could do.

SAS controller vendors such as LSI could build false failure detection and reset into the controller. When a drive fails to respond, the controller could give it quick kick before starting a RAID rebuild. This is harder on SATA drives as they lack some of the connections needed, but the folks that control the SATA spec could add a hard reset capability in the 6-12Gbps upgrade that's coming in the next few years. Short of that, array vendors could add the ability to cut the power to individual drives to force the reset.

No matter how you cut it, a 1% AFR is unacceptable. The industry should be working on real solutions, not just faster rebuilds.

Comment  | 
Print  | 
More Insights
Newest First  |  Oldest First  |  Threaded View
User Rank: Apprentice
3/28/2013 | 10:22:01 PM
re: False Disk Drive Failures Are a Real Problem
Problem is SATA has no real hard reset function. A RAID vendor could put relays on the power for the drives and do an external power cycle to "wake" the drive. They'd also need a way to journal changes to bring the drive up to date after it's little nap.

Re: SSDs no one has 20,000 or more SSDs in their data center to report rates. The basic problem that there's a little computer in each drive and that those computers sometimes go catatonic is the same.
User Rank: Apprentice
3/27/2013 | 1:56:50 PM
re: False Disk Drive Failures Are a Real Problem
Howard, Is it feasible for IT in-house or a third party to build software to automatically delay a RAID rebuild and give the drives that quick kick? That sounds like a nice little niche. And, is this problem negated with SSDs? Lorna Garey, IW Reports
More Blogs from Commentary
SDN: Waiting For The Trickle-Down Effect
Like server virtualization and 10 Gigabit Ethernet, SDN will eventually become a technology that small and midsized enterprises can use. But it's going to require some new packaging.
IT Certification Exam Success In 4 Steps
There are no shortcuts to obtaining passing scores, but focusing on key fundamentals of proper study and preparation will help you master the art of certification.
VMware's VSAN Benchmarks: Under The Hood
VMware touted flashy numbers in recently published performance benchmarks, but a closer examination of its VSAN testing shows why customers shouldn't expect the same results with their real-world applications.
Building an Information Security Policy Part 4: Addresses and Identifiers
Proper traffic identification through techniques such as IP addressing and VLANs are the foundation of a secure network.
SDN Strategies Part 4: Big Switch, Avaya, IBM,VMware
This series on SDN products concludes with a look at Big Switch's updated SDN strategy, VMware NSX, IBM's hybrid approach, and Avaya's focus on virtual network services.
Hot Topics
Converged Infrastructure: 3 Considerations
Bill Kleyman, National Director of Strategy & Innovation, MTM Technologies,  4/16/2014
White Papers
Register for Network Computing Newsletters
Current Issue
Twitter Feed