Storage

07:00 AM
Jim O'Reilly
Jim O'Reilly
Commentary
Connect Directly
LinkedIn
RSS
E-Mail
50%
50%

RAID Vs. Erasure Coding

Erasure coding offers better data protection than RAID, but at a price.

RAID, or Redundant Array of Independent Disks, is a familiar concept to most IT professionals. It’s a way to spread data over a set of drives to prevent the loss of a drive causing permanent loss of data. RAID falls into two categories: Either a complete mirror image of the data is kept on a second drive; or parity blocks are added to the data so that failed blocks can be recovered.

However, RAID comes with its own set of issues, which the industry has worked to overcome by developing new techniques, including erasure coding. Organizations need to consider pros and cons of the various data protection approaches when designing their storage systems.

First, let's look at the challenges that come with RAID. Both processes described above increase the amount of storage used. Mirroring obviously doubles data size, while parity typically adds one-fifth more data, though it is dependent on how many drives are in a set. There also are performance penalties. Writing an updated block involves two drive write operations, and parity may require blocks from all the drives in a set to be read.

When a drive fails, things start to get a bit rough. Typically an array has a spare drive or two for this contingency, or else the failed drive has to be replaced first. The next step involves copying data from good drives to the failed drive. This is easy enough with mirroring, but there is the risk of having a defect in the billions of blocks on the good drive that causes an unrecoverable data loss.

Parity recovery takes much longer to do, since all the data on all the drives in the set has to be read to allow generation of the missing blocks. The loss of another drive, or a bad block on any drive, will cause data loss.

In response to potential data loss risks, the industry created a dual-parity approach, where two non-overlapping parities are created for drive set. This increases capacity usage to two sevenths or thereabouts in typical arrays. Again, recovery involves reading one of the parities and all of the remaining data, and can take a long while.

RAID 6, the dual-parity approach, took a major hit with the release of 4TB and larger drives. Rebuild is measured in days, and the risk of another drive failing, which could result in data loss through bad blocks, is on the threshold of unacceptable.

This put the industry at a fork in the road. We needed a solution that maintained integrity and minimized bad block issues. Fortunately, there are answers! One solution is to mirror, but with more copies. This speeds up rebuild of a failed drive, since there are multiple sources for the data to be rewritten. Thinking evolved, and virtualizing the drives so that data is spread over a large number of drives made sense, since all of the drives could contribute to a rebuild.

This data distribution, coupled with replication of the dataset, is the technique used by most object storage systems, including services in the cloud. It’s common to make one copy a remote one. This provides disaster recovery if the other two copies go down for any reason, and has allowed sites to get back on line quickly in several “zone outages” where a whole datacenter segment has stopped operating.

Still, replication costs a lot of capacity. It takes as much as 3x the raw data. As a result, with the economics of very large datacenters on the line, RAID alternatives have been explored. One solution is to use erasure coding, which adds a bunch of parity-like blocks using complex math, creating a robust protection scheme that can tolerate high levels of failure. Here, again, it's possible to virtualize the drives so that the virtual drive is spread over more drives, thus speeding recovery.

Erasure coding is usually specified in an N+M format: 10+6, a common choice, means that data and erasure codes are spread over 16 (N+M) drives, and that any 10 of those can recover data. That means any six drives can fail. If the drives are on different appliances, the protection includes appliance failures, so six appliance boxes could go down without stopping operations.

Getting remote protection for primary storage is more complex with erasure coding than in replication. We would need a full recovery set of N (10) local drives for speed. For remote, additional blocks have to be created that are duplicates of others, so that we would have a 10+10 configuration. In practical terms, for primary storage there probably needs to be a couple more in each partial set, since performance impacts have to be handled; likely, we are talking 12+12. This is better than the replica approach, but it increases WAN network traffic a bit.

Anyone choosing an approach has to weigh the savings of erasure code against replication in capacity to see if they are worth the extra complexity. With commodity drives and cheaper appliances on the near horizon, this needs some careful planning since the cost of capacity may drop dramatically. The technical factors to consider are that replication is much faster on recovery from errors, but it obviously can better handle two rather than six drives failing.

Replication is appearing in BlockIO RAID boxes as an option. In essence, this is an extension of mirroring to make three copies, and one of those can be remote. The spreading of data over a larger drive set to speed recovery will appear at some point, which will bring RAID up to par with object storage protection.

Erasure code capability is available in open-source object stores such as Ceph, with Inktank support as well, so the choice will become available across the board in a few months.

My recommendation would be to consider replication for active primary and secondary data and use erasure coding for archived storage, where performance is not an issue. In archiving, it may not be necessary to go beyond 10+6, because the drives could be spread over as many as 16 locations, and because recovery is at a much lower performance level compared to primary storage.

Jim O'Reilly was Vice President of Engineering at Germane Systems, where he created ruggedized servers and storage for the US submarine fleet. He has also held senior management positions at SGI/Rackable and Verari; was CEO at startups Scalant and CDS; headed operations at PC ... View Full Bio
Comment  | 
Print  | 
More Insights
Comments
Newest First  |  Oldest First  |  Threaded View
timwessels
50%
50%
timwessels,
User Rank: Apprentice
7/16/2014 | 9:26:39 AM
Re: Good short explanation of RAID vs. Erasure Coding
Object storage is considered secondary storage and therefore performance has not been a major criteria.  One object storage software, Scality, claims that tests run on their RING storage clusters equal the performance of primary data storage systems.  The object storage software vendors who offer erasure coding in addition to replication include Caringo, Cloudian and Scality.  Ceph's commercial sponsor, InkTank, was recently purchased by Red Hat.  Sage Weil, who developed Ceph as part of his PhD work, is a genius kind of guy, but Ceph has not seen wide spread deployment in commercial environments yet.  Ditto for Swift.  Other object storage vendors, like Amplidata and Cleversafe, base their object storage solely on the use of erasure codes.  While some erasure codes are proprietary, many are based on or derived from Reed-Solomon, which has been around since the days of X.25 packet switching networks.  I recall back in the day that a lot of 1/4-inch cartridge tape drives and 4mm DAT drives used Reed-Solomon ECC to reliably write data to tape. 
Brian.Dean
50%
50%
Brian.Dean,
User Rank: Ninja
7/16/2014 | 6:17:19 AM
Re: Good short explanation of RAID vs. Erasure Coding
Erasure coding is an excellent frontier, in large setups it can create a load on CPU power, but I feel that this is minimal considering the rate of CPU advancement.

As the price war for Cloud Storage seems to be slowing down, this would be a good opportunity to offer Cloud data protection services to datacenters. However, considering that some enterprises only have an in-house datacenter, because of the efforts from intelligences services and other security needs -- encryption requirements for data in transit would create a higher CPU load.
joreilly925
50%
50%
joreilly925,
User Rank: Ninja
7/15/2014 | 2:36:59 PM
Re: Good short explanation of RAID vs. Erasure Coding
Aditshah, I think the hardest part was finding a mathematical structure that reflects the erasure code calculation! It's amenable to some hardware assist logic. Roll on a few SoCs?
joreilly925
50%
50%
joreilly925,
User Rank: Ninja
7/15/2014 | 2:34:57 PM
Re: Good short explanation of RAID vs. Erasure Coding
Tim,

I agree with your assessment, but there are plenty of people still in denial. I cringe, for instance, every time am SSD is compared price-wise with a $59 bulk SATA terabyte drive. It's the wrong comparison. MLC SSD is now cheaper than "enterprise" HDD, and there is no excuse for not moving to solid-state primary storage.

The migration to object storage is a bit more complex, since the vendors didn't pay enough attention until recently to performance and features. Opensource code like Ceph does now support erasure coding. We need the Chinese ODMs to hit their stride in the US channel market, and object storage will take over the secondary, bulk-storage, tier pretty quickly.
aditshar1
50%
50%
aditshar1,
User Rank: Ninja
7/15/2014 | 1:43:56 PM
Re: Good short explanation of RAID vs. Erasure Coding
When i think about use cases for erasure coding first thing comes to my mind is object-based cloud storage although understading its mathematical calculation has always been puzzle for me
timwessels
50%
50%
timwessels,
User Rank: Apprentice
7/15/2014 | 11:37:51 AM
Good short explanation of RAID vs. Erasure Coding
Well, I agree that hardware RAID as we have known it for over 20 years is dead, especially RAID levels that use parity to protect data.  Mirroring is still viable as there is no parity calculation needed.  In the near future there are only going to two types of storage...flash and object.  All primary or "hot" data will be stored in flash and everything else, which is 80% unstructured data, will be in object storage with the "warmer" data stored using replication for durability and and faster reads while the "colder" data will be stored using erasure codes for durability as it will seldom, if ever, be read.
Cartoon
Slideshows
Audio Interviews
Archived Audio Interviews
Jeremy Schulman, founder of Schprockits, a network automation startup operating in stealth mode, joins us to explore whether networking professionals all need to learn programming in order to remain employed.
White Papers
Register for Network Computing Newsletters
Current Issue
Video
Twitter Feed