Putting Erasure Codes To Work

Erasure coding is used in scale-out object storage systems from a variety of vendors and open-source software projects. Just don't expect it to solve all your data protection problems.

Howard Marks

December 16, 2014

4 Min Read
Network Computing logo

In my last blog post, I explained how advanced erasure codes can provide a higher level of reliability than more common RAID techniques. In this post, I’ll look at how vendors and open-source software projects are using erasure coding today, what we can look forward to in the future, and why erasure codes aren’t the data protection panacea some have made them out to be.

Erasure codes -- or, more specifically, erasure codes that provide data protection beyond double parity -- are today primarily used in scale-out object storage systems. These systems distribute the erasure encoded data blocks across multiple storage nodes to provide protection against, not just drive, but also node failures. Since object stores frequently hold hundreds of terabytes to petabytes of data, the 20 to 40% overhead of erasure coding allows operators to save racks of storage nodes when compared to the alternative, three- or four-way mirroring/replication.

Over the past year or two, most object stores, from commercial solutions like Data Direct Networks WOS and Caringo Swarm to open-source projects such as Ceph and Swift, have joined the pioneers of erasure coding, Cleversafe and Amplidata. Some object stores, like Ceph, limit the erasure coding to a single storage pool and rely on replication between storage pools, and therefore datacenters, to provide geographic protection.

The most sophisticated systems extend the erasure coding scheme to disperse encoded data chunks across multiple datacenters. A system using a 10-of-15 encoding scheme could store five chunks in each of five datacenters. This would allow the system to survive a datacenter failure with less than 40% storage overhead.

There’s no such thing as a free lunch, and storage system architects do have to pay a price for the high reliability and storage efficiency of erasure coding. The most obvious cost is in compute power. Calculating Reed-Solomon or turbo codes takes a lot more compute horsepower than simple parity, so systems using erasure coding need more CPU cores per PB than those using simple RAID. Luckily, the ceaseless increases in compute power predicted by Moore’s Law have made that compute power readily available.

But, unfortunately, regardless of how much CPU horsepower we throw at them, erasure codes will also have higher latency and require more back-end storage I/O operations than simpler data protection schemes like replication or parity RAID. Under normal conditions, a conventional RAID system can just read the data it needs, leaving its parity strips to be read only when it can’t read a data strip.

An erasure coded system, even one using erasure codes across local drives, would have to read at least the minimum number of data strips needed to recover a data block and then recalculate the original data from them. For data encoded in, say, a 10-of-16 scheme, that would mean that even the smallest read would require 10 I/O operations on the back-end storage and a delay for the calculations.

When writing data, the latency and I/O amplification erasure coding creates is even worse. Imagine a database writing random 8K blocks to a storage system that uses a 10-of-16 encoding scheme. To write 8K of data, the system would have to read 10 strips, each of which is at least 4K to match up to today’s drive’s sector size, recalculate the coded strips, and write 16 strips back out, turning one I/O request into 26 I/O operations on the back end.

On today’s scale-out object system, the access node that’s responding to a read request sends requests for all 16 stripes to the storage nodes holding them, and uses the first 10 responses to recalculate the data. Recalling all those data strips across the network is a significant amount of network traffic.

While today’s high-bandwidth, low-latency datacenter networks minimize this impact for local access, systems using dispersal codes to spread data strips across multiple datacenters will be performance limited by the bandwidth and latency of their WAN connections. Since these systems request all the data strips at once, and use the first to arrive, they consume network bandwidth, sending data strips that won’t be needed to retrieve the original data.

I’m hoping that one or more of the vendors making dispersal-based systems comes up with a system that will support a configuration where sufficient data strips to access the data are stored in a primary datacenter, and strips from remote datacenters are only recalled as needed. Such a system with a 10-of-20 coding scheme could keep 10 strips in the primary datacenter and five in each of two remote datacenters. The system could survive the loss of a datacenter with just 50% overhead  

Regardless of how they’re implemented, high-reliability erasure codes are best suited to those applications that do large I/Os, just like the object stores where they’ve advanced from cutting-edge technology a few years ago to a standard feature today. Those looking to use them for more transactional applications will be disappointed at their performance.

Disclosure: Amplidata has been a client of DeepStorage LLC, and Mark Garos, CEO of Caringo, bought me a nice dinner the last time I saw him in New York.

About the Author(s)

Howard Marks

Network Computing Blogger

Howard Marks</strong>&nbsp;is founder and chief scientist at Deepstorage LLC, a storage consultancy and independent test lab based in Santa Fe, N.M. and concentrating on storage and data center networking. In more than 25 years of consulting, Marks has designed and implemented storage systems, networks, management systems and Internet strategies at organizations including American Express, J.P. Morgan, Borden Foods, U.S. Tobacco, BBDO Worldwide, Foxwoods Resort Casino and the State University of New York at Purchase. The testing at DeepStorage Labs is informed by that real world experience.</p><p>He has been a frequent contributor to <em>Network Computing</em>&nbsp;and&nbsp;<em>InformationWeek</em>&nbsp;since 1999 and a speaker at industry conferences including Comnet, PC Expo, Interop and Microsoft's TechEd since 1990. He is the author of&nbsp;<em>Networking Windows</em>&nbsp;and co-author of&nbsp;<em>Windows NT Unleashed</em>&nbsp;(Sams).</p><p>He is co-host, with Ray Lucchesi of the monthly Greybeards on Storage podcast where the voices of experience discuss the latest issues in the storage world with industry leaders.&nbsp; You can find the podcast at: http://www.deepstorage.net/NEW/GBoS

SUBSCRIBE TO OUR NEWSLETTER
Stay informed! Sign up to get expert advice and insight delivered direct to your inbox
More Insights