More on Advanced Erasure Codes

As we've previously discussed in "What Comes After RAID? Erasure Codes," forward error correction coding is a leading contender to replace parity RAID as disk hardware evolves past the point where parity provides effective protection. The question remains: Are Reed-Solomon and related coding techniques the inevitable replacement for parity in the RAID systems of the future?

Howard Marks

January 10, 2011

3 Min Read
Network Computing logo

As we've previously discussed in "What Comes After RAID? Erasure Codes," forward error correction coding is a leading contender to replace parity RAID as disk hardware evolves past the point where parity provides effective protection. The question remains: Are Reed-Solomon and related coding techniques the inevitable replacement for parity in the RAID systems of the future?

I've had some interesting conversations on the subject with some fellow members of the storage cognoscenti. My friend, and soon to be fellow Network Computing blogger, Stephen Foskett thinks that high-level erasure codes are ready for primary storage. While I'm relatively satisfied that advanced erasure codes are a better solution than even double parity for secondary storage applications like backups and archives, I have a few reservations about advanced erasure codes for latency-sensitive applications such as online transaction processing databases.

I'm most concerned about the overhead, and resulting latency, that advanced erasure codes present for small writes. In any RAID system beyond mirroring, the systems write behavior depends on whether the data write is smaller or larger than a stripe across the entire RAID set. For large write requests--like those in backups, digital video or other multimedia applications--writing large chunks of data across X drives generates X+N disk I/Os, where N is the number of additional ECC blocks. So for RAID 5 it's X+1, RAID 6 N+2 and for a Reed-Solomon scheme like CleverSafe's 10 of 16 coding 10+6.

Things get more complicated when the amount of data to be written is smaller than X times the stripe size, which is commonly 4-128KB, for the array. To write one 4K database page, a RAID 5 system would have to read X blocks, recalculate the parity and then write X+1 blocks. On the 10 of 16 system it would have to read 10 blocks and then write 16. Add in that the processor on the storage system has a lot more math to do, and this process could add substantial latency to every small write.

One solution to this is to front end the erasure code logic with a write-in-place write-in-free-space file system, like WAFL, ZFS or Nimble Storage's CASL, that uses logging to maximize full stripe writes.  Since the file system already separates the logical data location from its physical location, it can coalesce small writes to a journal, which can be stored in flash, and the write multiple, otherwise unrelated writes to disk in full stripe writes. Since these systems often use redirect on write snapshot technology, they also want to leave the old data in place for snapshots.Then there's the decoding overhead. Conventional RAID systems store the original data plus additional parity data. Under normal conditions, the system satisfies read requests by reading the original data from the disks that hold it, ignoring the additional parity data.  The compute overhead in the RAID controller for reads is minimal.  

With advanced erasure codes, the data and forward error correction information is encoded into every data chunk. To recover the data, the system has to retrieve the minimum number of chunks that the coding system it uses requires, and then decode those chunks back to recover the data. So a Cleversafe system has to retrieve 10 chunks, of the 16 it originally stored, and decode them to satisfy a read request.  

The requirement to always retrieve a minimum number of chunks and then decode them substantially increases the compute load on a system using advanced erasure codes. It should also be noted that it increases the overhead on small writes, as the data that isn't being overwritten needs to be decoded, combined with the new data and recoded.

As a result, all the systems we've discussed that use high-level erasure codes also share a scale-out architecture. Having a Xeon processor for every four to 18 disk drives gives these systems the compute horsepower to handle encoding and decoding data using these more sophisticated ECC methods that a conventional midrange array with  four Xeons for 800 or more disk drives wouldn't.

About the Author(s)

Howard Marks

Network Computing Blogger

Howard Marks</strong>&nbsp;is founder and chief scientist at Deepstorage LLC, a storage consultancy and independent test lab based in Santa Fe, N.M. and concentrating on storage and data center networking. In more than 25 years of consulting, Marks has designed and implemented storage systems, networks, management systems and Internet strategies at organizations including American Express, J.P. Morgan, Borden Foods, U.S. Tobacco, BBDO Worldwide, Foxwoods Resort Casino and the State University of New York at Purchase. The testing at DeepStorage Labs is informed by that real world experience.</p><p>He has been a frequent contributor to <em>Network Computing</em>&nbsp;and&nbsp;<em>InformationWeek</em>&nbsp;since 1999 and a speaker at industry conferences including Comnet, PC Expo, Interop and Microsoft's TechEd since 1990. He is the author of&nbsp;<em>Networking Windows</em>&nbsp;and co-author of&nbsp;<em>Windows NT Unleashed</em>&nbsp;(Sams).</p><p>He is co-host, with Ray Lucchesi of the monthly Greybeards on Storage podcast where the voices of experience discuss the latest issues in the storage world with industry leaders.&nbsp; You can find the podcast at: http://www.deepstorage.net/NEW/GBoS

SUBSCRIBE TO OUR NEWSLETTER
Stay informed! Sign up to get expert advice and insight delivered direct to your inbox
More Insights