Special Coverage Series

Network Computing

Special Coverage Series

Commentary

Howard Marks
Howard Marks Network Computing Blogger

False Disk Drive Failures Are a Real Problem

New information shows disk drives report false failures with alarming frequency. False failures have real costs for data center operators. Here’s what the industry can do about it.

Disk drive vendors have been telling us for years that more than half the drives that get returned to them for warranty repairs fall into the category they call NPF--no problem found. As an IT professional, I assumed that the true cause of this phenomenon, like many of the problems that beset the helpdesk, was located between the keyboard and chair. A recent blog post by LSI's Rob Ober shines new light on the subject of false drive failures and has me wondering why this problem persists.

Ober notes that false failures are a major problem not just for hobbyists that buy bare drives from Fry's and Newegg but also for major data center operators. Data center operators, like thee and me, have substantial costs when a drive fails. For instance:

More Insights

Webcasts

More >>

White Papers

More >>

Reports

More >>

• System performance drops off, often for days, as the RAID system rebuilds as much as 4TB of data onto a hot spare drive. In distributed environments using scale-out storage, this also affects network traffic as the rebuild data has to be consolidated across multiple storage nodes.

• Someone has to go change the drive.

• Because the drive has sensitive corporate data, it has to be sanitized or destroyed. If you're not big enough to have an agreement with your storage vendor to replace failed drives on your say-so, it may mean you also have to pay for the replacement as you can't return drives.

The problem is that today's disk drives are run by internal microcontrollers that have firmware. Just like your PC or Mac, that software occasionally gets confused and the processor freezes. The drive hits a series of requests and states that weren't completely debugged in the development process and its processor stops responding to commands from the host or RAID controller.

If a host or RAID controller reports such a drive as failed, that drive will work just fine when removed from the host and tested elsewhere. (We all know turning off and turning on the power solves a lot of computing problems.) In fact studies have shown that drives that have suffered this kind of false failure are just as reliable, after they get a reset, as new drives fresh from the factory.

Mr. Ober actually got a large data center operator, who remains nameless, to share its drive failure statistics with him. This datacenter, while small by Google or Facebook standards, is pretty huge with over 200,000 servers.

They found:

• 30+% of their SAS drive failures are false, adding up to 10-15 a day or a 1/1000 annual false failure rate.

• SATA drives, directly connected to server motherboards, have an even higher false failure rate, approaching the 50% number that drive vendors have long reported, and a frightening 1% annual false failure rate.

A few vendors have tried to address the problem. Five years ago Xiotech and Atrato were talking up "self-healing" disk arrays that would perform repair tasks rather than starting a RAID rebuild immediately when a drive stopped responding to commands. Xiotech, working closely with Seagate, could even keep running a drive with a damaged surface or failed head by mapping accesses around it. Of course the first step in this recovery process was to perform a hard reset on the drive.

[ Join us at Interop Las Vegas for access to 125+ IT sessions and 300+ exhibiting companies. Register today! ]

With the industry turning its fickle attention to flash, self-healing arrays aren't cool anymore. Atrato has gone the way of all flesh and Xiotech, now re-named X-IO has faded in relevance as its last independent competitors Compellent, 3Par and even Nexsan were acquired.

Because the disk drive market is essentially a duopoly selling high-volume, low-margin products, I don't expect Seagate or Western Digital to build a highly redundant circuit board into drives that could detect false failures and reset itself. However there are a few things industry players, including LSI, could do.

SAS controller vendors such as LSI could build false failure detection and reset into the controller. When a drive fails to respond, the controller could give it quick kick before starting a RAID rebuild. This is harder on SATA drives as they lack some of the connections needed, but the folks that control the SATA spec could add a hard reset capability in the 6-12Gbps upgrade that's coming in the next few years. Short of that, array vendors could add the ability to cut the power to individual drives to force the reset.

No matter how you cut it, a 1% AFR is unacceptable. The industry should be working on real solutions, not just faster rebuilds.



Related Reading



Network Computing encourages readers to engage in spirited, healthy debate, including taking us to task. However, Network Computing moderates all comments posted to our site, and reserves the right to modify or remove any content that it determines to be derogatory, offensive, inflammatory, vulgar, irrelevant/off-topic, racist or obvious marketing/SPAM. Network Computing further reserves the right to disable the profile of any commenter participating in said activities.

 
Disqus Tips To upload an avatar photo, first complete your Disqus profile. | Please read our commenting policy.
 

Editor's Choice

Research: 2014 State of Server Technology

Research: 2014 State of Server Technology

Buying power and influence are rapidly shifting to service providers. Where does that leave enterprise IT? Not at the cutting edge, thatís for sure: Only 19% are increasing both the number and capability of servers, budgets are level or down for 60% and just 12% are using new micro technology.
Get full survey results now! »

Vendor Turf Wars

Vendor Turf Wars

The enterprise tech market used to be an orderly place, where vendors had clearly defined markets. No more. Driven both by increasing complexity and Wall Street demands for growth, big vendors are duking it out for primacy -- and refusing to work together for IT's benefit. Must we now pick a side, or is neutrality an option?
Get the Digital Issue »

WEBCAST: Software Defined Networking (SDN) First Steps

WEBCAST: Software Defined Networking (SDN) First Steps


Software defined networking encompasses several emerging technologies that bring programmable interfaces to data center networks and promise to make networks more observable and automated, as well as better suited to the specific needs of large virtualized data centers. Attend this webcast to learn the overall concept of SDN and its benefits, describe the different conceptual approaches to SDN, and examine the various technologies, both proprietary and open source, that are emerging.
Register Today »

Related Content

From Our Sponsor

How Data Center Infrastructure Management Software Improves Planning and Cuts Operational Cost

How Data Center Infrastructure Management Software Improves Planning and Cuts Operational Cost

Business executives are challenging their IT staffs to convert data centers from cost centers into producers of business value. Data centers can make a significant impact to the bottom line by enabling the business to respond more quickly to market demands. This paper demonstrates, through a series of examples, how data center infrastructure management software tools can simplify operational processes, cut costs, and speed up information delivery.

Impact of Hot and Cold Aisle Containment on Data Center Temperature and Efficiency

Impact of Hot and Cold Aisle Containment on Data Center Temperature and Efficiency

Both hot-air and cold-air containment can improve the predictability and efficiency of traditional data center cooling systems. While both approaches minimize the mixing of hot and cold air, there are practical differences in implementation and operation that have significant consequences on work environment conditions, PUE, and economizer mode hours. The choice of hot-aisle containment over cold-aisle containment can save 43% in annual cooling system energy cost, corresponding to a 15% reduction in annualized PUE. This paper examines both methodologies and highlights the reasons why hot-aisle containment emerges as the preferred best practice for new data centers.

Monitoring Physical Threats in the Data Center

Monitoring Physical Threats in the Data Center

Traditional methodologies for monitoring the data center environment are no longer sufficient. With technologies such as blade servers driving up cooling demands and regulations such as Sarbanes-Oxley driving up data security requirements, the physical environment in the data center must be watched more closely. While well understood protocols exist for monitoring physical devices such as UPS systems, computer room air conditioners, and fire suppression systems, there is a class of distributed monitoring points that is often ignored. This paper describes this class of threats, suggests approaches to deploying monitoring devices, and provides best practices in leveraging the collected data to reduce downtime.

Cooling Strategies for Ultra-High Density Racks and Blade Servers

Cooling Strategies for Ultra-High Density Racks and Blade Servers

Rack power of 10 kW per rack or more can result from the deployment of high density information technology equipment such as blade servers. This creates difficult cooling challenges in a data center environment where the industry average rack power consumption is under 2 kW. Five strategies for deploying ultra-high power racks are described, covering practical solutions for both new and existing data centers.

Power and Cooling Capacity Management for Data Centers

Power and Cooling Capacity Management for Data Centers

High density IT equipment stresses the power density capability of modern data centers. Installation and unmanaged proliferation of this equipment can lead to unexpected problems with power and cooling infrastructure including overheating, overloads, and loss of redundancy. The ability to measure and predict power and cooling capability at the rack enclosure level is required to ensure predictable performance and optimize use of the physical infrastructure resource. This paper describes the principles for achieving power and cooling capacity management.