More on Advanced Erasure Codes
As we've previously discussed in "What Comes After RAID? Erasure Codes," forward error correction coding is a leading contender to replace parity RAID as disk hardware evolves past the point where parity provides effective protection. The question remains: Are Reed-Solomon and related coding techniques the inevitable replacement for parity in the RAID systems of the future?
January 10, 2011
As we've previously discussed in "What Comes After RAID? Erasure Codes," forward error correction coding is a leading contender to replace parity RAID as disk hardware evolves past the point where parity provides effective protection. The question remains: Are Reed-Solomon and related coding techniques the inevitable replacement for parity in the RAID systems of the future?
I've had some interesting conversations on the subject with some fellow members of the storage cognoscenti. My friend, and soon to be fellow Network Computing blogger, Stephen Foskett thinks that high-level erasure codes are ready for primary storage. While I'm relatively satisfied that advanced erasure codes are a better solution than even double parity for secondary storage applications like backups and archives, I have a few reservations about advanced erasure codes for latency-sensitive applications such as online transaction processing databases.
I'm most concerned about the overhead, and resulting latency, that advanced erasure codes present for small writes. In any RAID system beyond mirroring, the systems write behavior depends on whether the data write is smaller or larger than a stripe across the entire RAID set. For large write requests--like those in backups, digital video or other multimedia applications--writing large chunks of data across X drives generates X+N disk I/Os, where N is the number of additional ECC blocks. So for RAID 5 it's X+1, RAID 6 N+2 and for a Reed-Solomon scheme like CleverSafe's 10 of 16 coding 10+6.
Things get more complicated when the amount of data to be written is smaller than X times the stripe size, which is commonly 4-128KB, for the array. To write one 4K database page, a RAID 5 system would have to read X blocks, recalculate the parity and then write X+1 blocks. On the 10 of 16 system it would have to read 10 blocks and then write 16. Add in that the processor on the storage system has a lot more math to do, and this process could add substantial latency to every small write.
One solution to this is to front end the erasure code logic with a write-in-place write-in-free-space file system, like WAFL, ZFS or Nimble Storage's CASL, that uses logging to maximize full stripe writes. Since the file system already separates the logical data location from its physical location, it can coalesce small writes to a journal, which can be stored in flash, and the write multiple, otherwise unrelated writes to disk in full stripe writes. Since these systems often use redirect on write snapshot technology, they also want to leave the old data in place for snapshots.Then there's the decoding overhead. Conventional RAID systems store the original data plus additional parity data. Under normal conditions, the system satisfies read requests by reading the original data from the disks that hold it, ignoring the additional parity data. The compute overhead in the RAID controller for reads is minimal.
With advanced erasure codes, the data and forward error correction information is encoded into every data chunk. To recover the data, the system has to retrieve the minimum number of chunks that the coding system it uses requires, and then decode those chunks back to recover the data. So a Cleversafe system has to retrieve 10 chunks, of the 16 it originally stored, and decode them to satisfy a read request.
The requirement to always retrieve a minimum number of chunks and then decode them substantially increases the compute load on a system using advanced erasure codes. It should also be noted that it increases the overhead on small writes, as the data that isn't being overwritten needs to be decoded, combined with the new data and recoded.
As a result, all the systems we've discussed that use high-level erasure codes also share a scale-out architecture. Having a Xeon processor for every four to 18 disk drives gives these systems the compute horsepower to handle encoding and decoding data using these more sophisticated ECC methods that a conventional midrange array with four Xeons for 800 or more disk drives wouldn't.
About the Author
You May Also Like