Howard Marks

Network Computing Blogger

Upcoming Events

Where the Cloud Touches Down: Simplifying Data Center Infrastructure Management

Thursday, July 25, 2013
10:00 AM PT/1:00 PM ET

In most data centers, DCIM rests on a shaky foundation of manual record keeping and scattered documentation. OpManager replaces data center documentation with a single repository for data, QRCodes for asset tracking, accurate 3D mapping of asset locations, and a configuration management database (CMDB). In this webcast, sponsored by ManageEngine, you will see how a real-world datacenter mapping stored in racktables gets imported into OpManager, which then provides a 3D visualization of where assets actually are. You'll also see how the QR Code generator helps you make the link between real assets and the monitoring world, and how the layered CMDB provides a single point of view for all your configuration data.

Register Now!

A Network Computing Webinar:
SDN First Steps

Thursday, August 8, 2013
11:00 AM PT / 2:00 PM ET

This webinar will help attendees understand the overall concept of SDN and its benefits, describe the different conceptual approaches to SDN, and examine the various technologies, both proprietary and open source, that are emerging. It will also help users decide whether SDN makes sense in their environment, and outline the first steps IT can take for testing SDN technologies.

Register Now!

More Events »

Subscribe to Newsletter

  • Keep up with all of the latest news and analysis on the fast-moving IT industry with Network Computing newsletters.
Sign Up

See more from this blogger


In a recent blog post, my friend and fellow storage analyst Ray Lucchesi suggested we may be living dangerously by combining today's SSDs, with their limited write endurance, and RAID. Ray suggests changes to SSDs to make them work better with RAID.

I think Ray has it backward. We shouldn't be worried about how to make SSDs work better with RAID. We should think about whether RAID might need to be adjusted, or replaced, to support SSDs.

More Insights


More >>

White Papers

More >>


More >>

Ray wrote his post after reading Antifragile: Things That Gain from Disorder, a new bestseller by Nassim Nicholas Taleb, author of "The Black Swan." Taleb's premise is that there are systems that not only tolerate faults and other stressors but actually improve because of them. Fragile systems, by contrast, fail when stressed. Basically, antifragile systems are the embodiment of "that which doesn't kill you makes you stronger."

[Critical business apps can take advantage of SSD speeds. Find out how in "Flash Balances The Books With Atomic Writes."]

The problem Ray is concerned about is that limited write endurance would result in multiple SSDs failing in close enough succession to cause data loss. Because RAID was designed to deal with disk drives, which don't really wear out but just fail randomly, Ray argues that we should change SSDs to increase the normality, or randomness, of their failures. This would spread out failures over time. As a result, the RAID system could rebuild one failed SSD, and the operator could replace the failed drive before a second SSD failed.

Ray then suggests that storage professionals and the SSD vendors make some changes to our practices to spread the failures across enough time to make RAID work well:

• Intermix older and newer (fresher) SSDs in a single RAID so they don't all fail together

• Avoid writing the same amount of data to multiple SSDs by mirroring SSDs or wide striping even across multiple RAIDsets

• Mix SSDs with different write endurance levels in the same RAIDset

• Eliminate SSD wear leveling using defect skipping instead

These changes, especially eliminating wear leveling, would increase the random distribution of SSD failures. The problem is, they would also reduce the life of the SSDs. I for one would rather have a pile of SSDs that will all fail after 30,000 cycles than those that will fail randomly between 10,000 and 25,000 cycles.

Just about every SSD has SMART (Self-Monitoring, Analysis and Reporting Technology) counters that report not only how many flash pages have failed, but also the remaining percentage of the device's promised write endurance.

If our RAID controllers (and their software equivalents, as we move to software-defined storage) simply monitored these counters, they could send the operator a message; even better, they could send a message to the vendor's support group. The drives could be replaced before they exhaust their write endurance.

Part of our problem is that we've been running RAID so long it's become ingrained in our basic idea of storage. But RAID itself was a solution to a problem. Patterson, Gibson and Katz originally proposed RAID because making disk drives bigger and faster was getting too expensive. RAID was designed so that an array of inexpensive drives could be bigger and faster than a SLED (Single Large Expensive Drive).

SSDs are plenty fast, so we rarely need RAID to increase their speed. However, our reliability expectations have also been raised, so we do need some sort of redundancy to make them more reliable. I say it's better to accept that flash is different than disks and change our software than sacrifice flash's advantages to make it fit a 25-year old RAID design.

Those building post-RAID data protection schemes for SSDs, and hybrid environments, should modify the old mirroring, parity and double parity schemes to not only protect against a device failure, but also to minimize the amount of write amplification they create in the process. Avoiding writes, especially small writes, will extend SSD life and therefore reliability.

Rather than eliminating wear leveling, which would be difficult if not impossible given that SSD controllers need to constantly write data to blank pages, they should extend it to even wear across not only the flash in an SSD, but across the SSDs in the system.

We'll get higher availability if we focus on overall failure avoidance instead of enhanced failure recovery.

Am I onto something here, or is Ray's approach more sensible? Feel free to pick sides in the comments section.

Related Reading

Network Computing encourages readers to engage in spirited, healthy debate, including taking us to task. However, Network Computing moderates all comments posted to our site, and reserves the right to modify or remove any content that it determines to be derogatory, offensive, inflammatory, vulgar, irrelevant/off-topic, racist or obvious marketing/SPAM. Network Computing further reserves the right to disable the profile of any commenter participating in said activities.

Disqus Tips To upload an avatar photo, first complete your Disqus profile. | Please read our commenting policy.
Vendor Comparisons
Network Computing’s Vendor Comparisons provide extensive details on products and services, including downloadable feature matrices. Our categories include:

Research and Reports

Network Computing: April 2013

TechWeb Careers