Special Coverage Series

Network Computing

Special Coverage Series

Commentary

Howard Marks
Howard Marks Network Computing Blogger

SSDs vs. RAID? Fix RAID

SSD write endurance limitations could cause problems for RAID. But rather than adjust SSDs to support RAID, we should tweak or replace RAID to suit SSDs.

In a recent blog post, my friend and fellow storage analyst Ray Lucchesi suggested we may be living dangerously by combining today's SSDs, with their limited write endurance, and RAID. Ray suggests changes to SSDs to make them work better with RAID.

I think Ray has it backward. We shouldn't be worried about how to make SSDs work better with RAID. We should think about whether RAID might need to be adjusted, or replaced, to support SSDs.

More Insights

Webcasts

More >>

White Papers

More >>

Reports

More >>

Ray wrote his post after reading Antifragile: Things That Gain from Disorder, a new bestseller by Nassim Nicholas Taleb, author of "The Black Swan." Taleb's premise is that there are systems that not only tolerate faults and other stressors but actually improve because of them. Fragile systems, by contrast, fail when stressed. Basically, antifragile systems are the embodiment of "that which doesn't kill you makes you stronger."

[Critical business apps can take advantage of SSD speeds. Find out how in "Flash Balances The Books With Atomic Writes."]

The problem Ray is concerned about is that limited write endurance would result in multiple SSDs failing in close enough succession to cause data loss. Because RAID was designed to deal with disk drives, which don't really wear out but just fail randomly, Ray argues that we should change SSDs to increase the normality, or randomness, of their failures. This would spread out failures over time. As a result, the RAID system could rebuild one failed SSD, and the operator could replace the failed drive before a second SSD failed.

Ray then suggests that storage professionals and the SSD vendors make some changes to our practices to spread the failures across enough time to make RAID work well:

• Intermix older and newer (fresher) SSDs in a single RAID so they don't all fail together

• Avoid writing the same amount of data to multiple SSDs by mirroring SSDs or wide striping even across multiple RAIDsets

• Mix SSDs with different write endurance levels in the same RAIDset

• Eliminate SSD wear leveling using defect skipping instead

These changes, especially eliminating wear leveling, would increase the random distribution of SSD failures. The problem is, they would also reduce the life of the SSDs. I for one would rather have a pile of SSDs that will all fail after 30,000 cycles than those that will fail randomly between 10,000 and 25,000 cycles.

Just about every SSD has SMART (Self-Monitoring, Analysis and Reporting Technology) counters that report not only how many flash pages have failed, but also the remaining percentage of the device's promised write endurance.

If our RAID controllers (and their software equivalents, as we move to software-defined storage) simply monitored these counters, they could send the operator a message; even better, they could send a message to the vendor's support group. The drives could be replaced before they exhaust their write endurance.

Part of our problem is that we've been running RAID so long it's become ingrained in our basic idea of storage. But RAID itself was a solution to a problem. Patterson, Gibson and Katz originally proposed RAID because making disk drives bigger and faster was getting too expensive. RAID was designed so that an array of inexpensive drives could be bigger and faster than a SLED (Single Large Expensive Drive).

SSDs are plenty fast, so we rarely need RAID to increase their speed. However, our reliability expectations have also been raised, so we do need some sort of redundancy to make them more reliable. I say it's better to accept that flash is different than disks and change our software than sacrifice flash's advantages to make it fit a 25-year old RAID design.

Those building post-RAID data protection schemes for SSDs, and hybrid environments, should modify the old mirroring, parity and double parity schemes to not only protect against a device failure, but also to minimize the amount of write amplification they create in the process. Avoiding writes, especially small writes, will extend SSD life and therefore reliability.

Rather than eliminating wear leveling, which would be difficult if not impossible given that SSD controllers need to constantly write data to blank pages, they should extend it to even wear across not only the flash in an SSD, but across the SSDs in the system.

We'll get higher availability if we focus on overall failure avoidance instead of enhanced failure recovery.

Am I onto something here, or is Ray's approach more sensible? Feel free to pick sides in the comments section.



Related Reading



Network Computing encourages readers to engage in spirited, healthy debate, including taking us to task. However, Network Computing moderates all comments posted to our site, and reserves the right to modify or remove any content that it determines to be derogatory, offensive, inflammatory, vulgar, irrelevant/off-topic, racist or obvious marketing/SPAM. Network Computing further reserves the right to disable the profile of any commenter participating in said activities.

 
Disqus Tips To upload an avatar photo, first complete your Disqus profile. | Please read our commenting policy.
 

Editor's Choice

Research: 2014 State of Server Technology

Research: 2014 State of Server Technology

Buying power and influence are rapidly shifting to service providers. Where does that leave enterprise IT? Not at the cutting edge, thatís for sure: Only 19% are increasing both the number and capability of servers, budgets are level or down for 60% and just 12% are using new micro technology.
Get full survey results now! »

Vendor Turf Wars

Vendor Turf Wars

The enterprise tech market used to be an orderly place, where vendors had clearly defined markets. No more. Driven both by increasing complexity and Wall Street demands for growth, big vendors are duking it out for primacy -- and refusing to work together for IT's benefit. Must we now pick a side, or is neutrality an option?
Get the Digital Issue »

WEBCAST: Software Defined Networking (SDN) First Steps

WEBCAST: Software Defined Networking (SDN) First Steps


Software defined networking encompasses several emerging technologies that bring programmable interfaces to data center networks and promise to make networks more observable and automated, as well as better suited to the specific needs of large virtualized data centers. Attend this webcast to learn the overall concept of SDN and its benefits, describe the different conceptual approaches to SDN, and examine the various technologies, both proprietary and open source, that are emerging.
Register Today »

Related Content

From Our Sponsor

How Data Center Infrastructure Management Software Improves Planning and Cuts Operational Cost

How Data Center Infrastructure Management Software Improves Planning and Cuts Operational Cost

Business executives are challenging their IT staffs to convert data centers from cost centers into producers of business value. Data centers can make a significant impact to the bottom line by enabling the business to respond more quickly to market demands. This paper demonstrates, through a series of examples, how data center infrastructure management software tools can simplify operational processes, cut costs, and speed up information delivery.

Impact of Hot and Cold Aisle Containment on Data Center Temperature and Efficiency

Impact of Hot and Cold Aisle Containment on Data Center Temperature and Efficiency

Both hot-air and cold-air containment can improve the predictability and efficiency of traditional data center cooling systems. While both approaches minimize the mixing of hot and cold air, there are practical differences in implementation and operation that have significant consequences on work environment conditions, PUE, and economizer mode hours. The choice of hot-aisle containment over cold-aisle containment can save 43% in annual cooling system energy cost, corresponding to a 15% reduction in annualized PUE. This paper examines both methodologies and highlights the reasons why hot-aisle containment emerges as the preferred best practice for new data centers.

Monitoring Physical Threats in the Data Center

Monitoring Physical Threats in the Data Center

Traditional methodologies for monitoring the data center environment are no longer sufficient. With technologies such as blade servers driving up cooling demands and regulations such as Sarbanes-Oxley driving up data security requirements, the physical environment in the data center must be watched more closely. While well understood protocols exist for monitoring physical devices such as UPS systems, computer room air conditioners, and fire suppression systems, there is a class of distributed monitoring points that is often ignored. This paper describes this class of threats, suggests approaches to deploying monitoring devices, and provides best practices in leveraging the collected data to reduce downtime.

Cooling Strategies for Ultra-High Density Racks and Blade Servers

Cooling Strategies for Ultra-High Density Racks and Blade Servers

Rack power of 10 kW per rack or more can result from the deployment of high density information technology equipment such as blade servers. This creates difficult cooling challenges in a data center environment where the industry average rack power consumption is under 2 kW. Five strategies for deploying ultra-high power racks are described, covering practical solutions for both new and existing data centers.

Power and Cooling Capacity Management for Data Centers

Power and Cooling Capacity Management for Data Centers

High density IT equipment stresses the power density capability of modern data centers. Installation and unmanaged proliferation of this equipment can lead to unexpected problems with power and cooling infrastructure including overheating, overloads, and loss of redundancy. The ability to measure and predict power and cooling capability at the rack enclosure level is required to ensure predictable performance and optimize use of the physical infrastructure resource. This paper describes the principles for achieving power and cooling capacity management.