In a recent blog post, my friend and fellow storage analyst Ray Lucchesi suggested we may be living dangerously by combining today's SSDs, with their limited write endurance, and RAID. Ray suggests changes to SSDs to make them work better with RAID.
I think Ray has it backward. We shouldn't be worried about how to make SSDs work better with RAID. We should think about whether RAID might need to be adjusted, or replaced, to support SSDs.
Ray wrote his post after reading Antifragile: Things That Gain from Disorder, a new bestseller by Nassim Nicholas Taleb, author of "The Black Swan." Taleb's premise is that there are systems that not only tolerate faults and other stressors but actually improve because of them. Fragile systems, by contrast, fail when stressed. Basically, antifragile systems are the embodiment of "that which doesn't kill you makes you stronger."
[Critical business apps can take advantage of SSD speeds. Find out how in "Flash Balances The Books With Atomic Writes."]
The problem Ray is concerned about is that limited write endurance would result in multiple SSDs failing in close enough succession to cause data loss. Because RAID was designed to deal with disk drives, which don't really wear out but just fail randomly, Ray argues that we should change SSDs to increase the normality, or randomness, of their failures. This would spread out failures over time. As a result, the RAID system could rebuild one failed SSD, and the operator could replace the failed drive before a second SSD failed.
Ray then suggests that storage professionals and the SSD vendors make some changes to our practices to spread the failures across enough time to make RAID work well:
• Intermix older and newer (fresher) SSDs in a single RAID so they don't all fail together
• Avoid writing the same amount of data to multiple SSDs by mirroring SSDs or wide striping even across multiple RAIDsets
• Mix SSDs with different write endurance levels in the same RAIDset
• Eliminate SSD wear leveling using defect skipping instead
These changes, especially eliminating wear leveling, would increase the random distribution of SSD failures. The problem is, they would also reduce the life of the SSDs. I for one would rather have a pile of SSDs that will all fail after 30,000 cycles than those that will fail randomly between 10,000 and 25,000 cycles.
Just about every SSD has SMART (Self-Monitoring, Analysis and Reporting Technology) counters that report not only how many flash pages have failed, but also the remaining percentage of the device's promised write endurance.
If our RAID controllers (and their software equivalents, as we move to software-defined storage) simply monitored these counters, they could send the operator a message; even better, they could send a message to the vendor's support group. The drives could be replaced before they exhaust their write endurance.
Part of our problem is that we've been running RAID so long it's become ingrained in our basic idea of storage. But RAID itself was a solution to a problem. Patterson, Gibson and Katz originally proposed RAID because making disk drives bigger and faster was getting too expensive. RAID was designed so that an array of inexpensive drives could be bigger and faster than a SLED (Single Large Expensive Drive).
SSDs are plenty fast, so we rarely need RAID to increase their speed. However, our reliability expectations have also been raised, so we do need some sort of redundancy to make them more reliable. I say it's better to accept that flash is different than disks and change our software than sacrifice flash's advantages to make it fit a 25-year old RAID design.
Those building post-RAID data protection schemes for SSDs, and hybrid environments, should modify the old mirroring, parity and double parity schemes to not only protect against a device failure, but also to minimize the amount of write amplification they create in the process. Avoiding writes, especially small writes, will extend SSD life and therefore reliability.
Rather than eliminating wear leveling, which would be difficult if not impossible given that SSD controllers need to constantly write data to blank pages, they should extend it to even wear across not only the flash in an SSD, but across the SSDs in the system.
We'll get higher availability if we focus on overall failure avoidance instead of enhanced failure recovery.
Am I onto something here, or is Ray's approach more sensible? Feel free to pick sides in the comments section.