Even though the storage market as we know it is based entirely on RAID, it’s become fashionable in recent years for storage vendors to proudly proclaim that their shiny new toys don’t use old, broken-down RAID, but some other data protection scheme. While it’s true that today’s multi-terabyte drives expose problems with the traditional RAID levels, abandoning the term RAID seems to create more confusion than clarity.
Instead, I propose that we redefine the D in RAID (redundant array of independent disks) to mean data, or more properly, data objects.
One of my problems is that, even if they use similar technology, some vendors will scream, “We have a No-RAID solution!” while others will describe their technology as RAID. HP’s StoreVirtual, the VSA formerly known as Lefthand, will mirror, or stripe data with parity, across multiple storage nodes in a process the company calls “Network-RAID.” Other vendors similarly mirroring data across nodes will fly the "No-RAID" banner.
My new definition for RAID would therefore include technologies like the object level n-way mirroring via replication that forms the basis of many cloud scale-storage systems. Just as vendors in the past two decades created RAID-0, RAID-6, RAID-DP, and other RAID extensions, systems using sophisticated erasure codes for high reliability can create monikers like RAID-EC.
The truth is, old-fashioned RAID is ill-equipped to deal with disks like HGST’s 8 TB helium-filled behemoths. These drives create two distinct problems for traditional RAID: extended rebuild times and an unrecoverable error rate that’s uncomfortably close to the drive’s capacity.
The Ultra320 disk of the 90s could transfer 125 MByte/s, so reading or writing a whole 300 GB drive would take under an hour. Today’s drives transfer data only a little faster: 195 MB/s for the He8, which stretches the best-case full disk write-time to 11.4 hours, assuming, of course, that the system can write data to your hot spare drive at the full rated speed of the disk.
In the real world, where the rebuild process will run as a background process, rebuilding an 8 TB drive will take several days. If you’re using two-way mirroring (RAID-1) or single parity (RAID-5), your data will be unprotected until, and unless, the RAID rebuild process completes.
The other problem is that your RAID controller may not be able to rebuild a RAID set after a drive failure because one of the remaining drives has an unrecoverable read error. The spec sheet for today’s capacity-oriented enterprise drives says those drives will have an uncorrectable read error for every 10^15 bits read, regardless of whether those drives have SAS or SATA interfaces.
Since 8 TB is 6.4x10^13 bits, the probability of a read failure during a rebuild is too large to ignore, especially since these drives are typically used to support sequential workloads with lots of data, where the efficiency of large RAID sets is attractive. In a 4+1 RAID-5 set of 8 TB drives, the probability of a read failure during a rebuild is more than 22%. Grow that to a 10+1 set, and your rebuild will fail almost half the time.
Luckily, RAID solutions have evolved since the Patterson, Gibson, and Katz seminal paper, "A Case for Redundant Arrays of Inexpensive Disks," was published in 1988. After all, the academics thought that vendors would use bit-level striping, as defined in levels 2 and 3.
While I can’t remember ever seeing a RAID controller that supported RAID levels 2 or 3, the RAID controller in the Compaq SystemPro -- the first real x86 server -- used special Conner Peripherals IDE drives with synchronized spindle motors so it could write a sector to all the drives in the array simultaneously. Fortunately, today’s drives have DRAM buffers that eliminate the need for synchronized drives.
The most common solution, pioneered by 3Par and Compellent before their assimilation into HP and Dell, respectively, is to distribute the data and parity across all the drives in the array. This “chunklet” RAID, to use 3Par’s term, writes the same data and parity as traditional RAID, but rotates which drives the data gets written to as each stripe is written.
When a drive fails in one of these systems, the data that drive used to hold is rebuilt to free space on the remaining drives. Rather than copying from the n other drives in the RAID set to a single spare drive, the rebuild process in these systems is a many-to-many process, which allows it to complete much more quickly, especially for big drives.
Solving the Unrecoverable Read Error (URE) problem is a bit more complicated. First, the system has to be able to recognize when a read error occurs by checking the read data against a stored hash via T10 DIF, which stores a CRC with each sector, or hash the system stores separately from the data the way WAFL or ZFS does.
It also requires the system to have somewhere to go for the data when it discovers a read error like the second parity stripe of RAID-6 or a third mirror. In the best of all possible worlds, an array would be smart enough to know that the data was replicated to your DR site and could request the data it couldn’t read locally from the remote copy.
My friend, and fellow Tech Field Day delegate, Matt Simmons wrote a blog post on UREs that goes a little deeper into the math, although he focused on the home NAS environment with consumer drives and their higher error rates.
Of course, all this math assumes that drives are as reliable as the vendors' spec sheets indicate, which they probably are, and that drive failures are independent events. Add in any form of failure grouping, as may be caused by a bad batch of drives or drives failing under the heavier workload of a rebuild, and your backups will become even more important.
Disclosure: HP and Dell have been clients of DeepStorage LLC in the past.