STORAGE

  • 07/14/2014
    7:00 AM
  • Rating: 
    0 votes
    +
    Vote up!
    -
    Vote down!

RAID Vs. Erasure Coding

Erasure coding offers better data protection than RAID, but at a price.

RAID, or Redundant Array of Independent Disks, is a familiar concept to most IT professionals. It’s a way to spread data over a set of drives to prevent the loss of a drive causing permanent loss of data. RAID falls into two categories: Either a complete mirror image of the data is kept on a second drive; or parity blocks are added to the data so that failed blocks can be recovered.

However, RAID comes with its own set of issues, which the industry has worked to overcome by developing new techniques, including erasure coding. Organizations need to consider pros and cons of the various data protection approaches when designing their storage systems.

First, let's look at the challenges that come with RAID. Both processes described above increase the amount of storage used. Mirroring obviously doubles data size, while parity typically adds one-fifth more data, though it is dependent on how many drives are in a set. There also are performance penalties. Writing an updated block involves two drive write operations, and parity may require blocks from all the drives in a set to be read.

When a drive fails, things start to get a bit rough. Typically an array has a spare drive or two for this contingency, or else the failed drive has to be replaced first. The next step involves copying data from good drives to the failed drive. This is easy enough with mirroring, but there is the risk of having a defect in the billions of blocks on the good drive that causes an unrecoverable data loss.

Parity recovery takes much longer to do, since all the data on all the drives in the set has to be read to allow generation of the missing blocks. The loss of another drive, or a bad block on any drive, will cause data loss.

In response to potential data loss risks, the industry created a dual-parity approach, where two non-overlapping parities are created for drive set. This increases capacity usage to two sevenths or thereabouts in typical arrays. Again, recovery involves reading one of the parities and all of the remaining data, and can take a long while.

RAID 6, the dual-parity approach, took a major hit with the release of 4TB and larger drives. Rebuild is measured in days, and the risk of another drive failing, which could result in data loss through bad blocks, is on the threshold of unacceptable.

This put the industry at a fork in the road. We needed a solution that maintained integrity and minimized bad block issues. Fortunately, there are answers! One solution is to mirror, but with more copies. This speeds up rebuild of a failed drive, since there are multiple sources for the data to be rewritten. Thinking evolved, and virtualizing the drives so that data is spread over a large number of drives made sense, since all of the drives could contribute to a rebuild.

This data distribution, coupled with replication of the dataset, is the technique used by most object storage systems, including services in the cloud. It’s common to make one copy a remote one. This provides disaster recovery if the other two copies go down for any reason, and has allowed sites to get back on line quickly in several “zone outages” where a whole datacenter segment has stopped operating.

Still, replication costs a lot of capacity. It takes as much as 3x the raw data. As a result, with the economics of very large datacenters on the line, RAID alternatives have been explored. One solution is to use erasure coding, which adds a bunch of parity-like blocks using complex math, creating a robust protection scheme that can tolerate high levels of failure. Here, again, it's possible to virtualize the drives so that the virtual drive is spread over more drives, thus speeding recovery.

Erasure coding is usually specified in an N+M format: 10+6, a common choice, means that data and erasure codes are spread over 16 (N+M) drives, and that any 10 of those can recover data. That means any six drives can fail. If the drives are on different appliances, the protection includes appliance failures, so six appliance boxes could go down without stopping operations.

Getting remote protection for primary storage is more complex with erasure coding than in replication. We would need a full recovery set of N (10) local drives for speed. For remote, additional blocks have to be created that are duplicates of others, so that we would have a 10+10 configuration. In practical terms, for primary storage there probably needs to be a couple more in each partial set, since performance impacts have to be handled; likely, we are talking 12+12. This is better than the replica approach, but it increases WAN network traffic a bit.

Anyone choosing an approach has to weigh the savings of erasure code against replication in capacity to see if they are worth the extra complexity. With commodity drives and cheaper appliances on the near horizon, this needs some careful planning since the cost of capacity may drop dramatically. The technical factors to consider are that replication is much faster on recovery from errors, but it obviously can better handle two rather than six drives failing.

Replication is appearing in BlockIO RAID boxes as an option. In essence, this is an extension of mirroring to make three copies, and one of those can be remote. The spreading of data over a larger drive set to speed recovery will appear at some point, which will bring RAID up to par with object storage protection.

Erasure code capability is available in open-source object stores such as Ceph, with Inktank support as well, so the choice will become available across the board in a few months.

My recommendation would be to consider replication for active primary and secondary data and use erasure coding for archived storage, where performance is not an issue. In archiving, it may not be necessary to go beyond 10+6, because the drives could be spread over as many as 16 locations, and because recovery is at a much lower performance level compared to primary storage.

We welcome your comments on this topic on our social media channels, or [contact us directly] with questions about the site.