On Large Drives And RAID
Posted by Howard Marks on August 24, 2009
Xyratex's recent announcement that they've qualified Hitachi's 2 TB nearline drive for their disk systems got me thinking about how the RAID techniques of the past don't really address the needs of systems with many ginormous drives. As drives get bigger I worry that basic RAID-5 protection isn't sufficient for these beasts.
For a company that isn't a household name in even the geekest of households, Xyratex plays a strategic role in the storage industry. Many of the big names in the business, most significantly former parent IBM, OEM Xyratex RAID arrays as their low to midrange products. Even more vendors use Xyratex as a supplier of JBODs and SBODs or a contract manufacturer. We should start seeing 2TB drives in arrays from better known vendors in the past few months.
My concerns aren't based on the quality or failure rate of big drives, but the time it must take to rebuild from a hot spare after a drive failure. Just as scary is the quantity of data that has to be read, processed and written to the replacement drive is comparable to the drive's error rate.
In the few short years that capacity oriented drives, mostly but not necessarily with SATA interfaces, have worked their way into the data center, their capacity increased eightfold while their throughput has barely doubled. The ~130MB/s sustained data transfer rate that 1 and 2TB drives deliver is sufficient for the backup and archiving applications enterprises.
However, even if a RAID controller could rebuild a failed drive at 130MB/s, it would take over 4 hours to rebuild a 2TB drive. In the real world, I'd expect it to take at least 12 hours, even longer if the array is busy, since rebuilding is a lower priority task.
With an MTBF of 1.2 million hours, one could be lulled into a false sense of security by calculating that the probability of 2 drives in the 5-20 in a RAID set failing is somewhat lower than that of winning the Publisher's Clearinghouse Sweepstakes. Someone wins the sweepstakes every year. Drive failures come in bunches because the environmental problems, either in manufacturing or in deployments, that cause drive failures effect not just one drive but often a whole array or data center.




Comment by Bill on August 24, 2009 3:49 PM
I agree with your assessment. BTW, it's "3PAR", not 3par. You don't call the other guys Ibm, Hp or Emc, do you? (grin)
Reply to this comment
Comment by Mike Young on August 24, 2009 4:59 PM
I also agree with the final assessment, although I was wondering why you were leaving out RAID6 till I got to the end. :)
However, drive rebuilds don't always get the lower priority. Most RAID engines let you dial in the priority for new I/Os vs rebuilds.
Back in 2006, it seemed like the addition of RAID6 was such a hot priority with all of the RAID guys, it's hard to imagine folks might be building large arrays with RAID5. Is this really a problem still?
Mike Young
CEO, Cachengo
http://cachengo.com/blog
Reply to this comment
Comment by C.R.US.H. on August 24, 2009 5:56 PM
The smart folks at U.C. Santa Cruz Storage Systems Research Center wrote an academic paper on this issue called:
CRUSH: Controlled, Scalable, Decentralized Placement of Replicated Data.
Traditional RAID 5 & Raid 6 algorithms limit rebuild speed to the write speed of a single drive and Crush removes this limitation. I believe several vendors have delivered products based on their research.
http://www.ssrc.ucsc.edu/Papers/weil-sc06.pdf
Reply to this comment
Comment by Guerre on August 24, 2009 6:44 PM
Don't most CRUSH systems suffer from Swiss Cheese Syndrome? They handle a single drive failure very well, but a double drive failure that happens before the first drive rebuild finishes, will cause millions of random holes spread around every disk drive in the data center. This was debated ad nauseum between EMC bloggers and the XIV team which uses the CRUSH approach. A dual drive failure within a single RAID 5 array will cause the loss of one RAID set and the loss of the applications associated with it, but the loss of two disks anywhere in a CRUSH or XIV deployment will result in the complete loss of all IT resources in the data center. At least, that was the analysis by the EMC people and the argument seemed logical to me at the time.
Reply to this comment
Comment by CRUSH on August 24, 2009 8:52 PM
I don't know? It's been a while since I read the paper..
Take a look and let me know if you think it's vunerable to the SWISS cheese effect.
There may (or may not) be differences between a vendors implementation and the Academic paper.
Reply to this comment
Comment by anon on August 25, 2009 2:00 PM
If the copy in one data center becomes corrupt, you can rebuild from the copy in the other data center.
Keeping only one copy in one data center is not safe, no matter what RAID mode is implemented
Reply to this comment
Comment by HomeBuiltRaid on August 25, 2009 3:12 PM
Howard, nice article. We're debating and testing Raid 5, 6 and 10. Anyone have any thoughts on Raid 10? With SATA drives being so inexpensive, we're leaning towards Raid 10 and having 2 SAN's, one in the Data Center and one at our Hot Site.
Reply to this comment
Comment by Anonymous on August 25, 2009 5:08 PM
Good article.
With RAID rebuild times being so long for large drives, I'm seeing some end users mirroring their RAID 5 or RAID 6 subsystems in order to handle 2+ simultaneous drive failures.
This approach is great for drive manufacturers, but expensive and maybe overly complicated for end users.
Is there a better and simpler way to handle 2+ failures?
Graham Irving
www.storageclarity.com
Reply to this comment
Comment by storage guy on August 25, 2009 9:02 PM
It is important to understand that CRUSH can be implemented to be a RAID-6 (or dare I say RAID7) protection level. This pretty much covers the swiss-cheese issue. Remember that "buckets" in CRUSH can have different protection rules applied, such as on-site/off-site replication.
I like to think that to get around ALL of the issues facing an antiquated data protection layer does not rely on the DRIVE protection layer alone. As was pointed on in the article above, the true threat is more from a undetected data error on rebuild or use.
The smart companies are splitting the overall DPL into three layers, potentially CRUSH for disk failure protection, T10-PI (formally DIF) for path corruption and UCE protection coupled to hardened filesystems/application layer.
A 2TB drive will take over 22 hours on an idle system. If someone would be foolish enough to place a SATA drive in a high duty cycle IOPS environment, expect that number to be 10+ DAYS!
Oh and 8TB drives will be here in the next 2-3 years....
Reply to this comment
Comment by Error rates and SSDs on August 26, 2009 7:40 PM
I wonder if published error rates are really up to date with the new larger drives. If so, what makes us think there are no errors in the data we are raiding? In other words, the data already has errors in it.
One solution to higher data certaintywould be to move some of the error checking to the application level - let every database record have a sha-1 fingerprint at the end of the record, and checking them on both reads and writes. (I won't patent it, but please call this "Jensen's Thumbprint" in my honor.) I had a custom sha-1 dll written, and it processes 600 MB/sec, so speed shouldn't be an issue.
Second, I wonder about adding a hybrid solution for faster rebuilds. Add some low cost SSDs - the Crucial M225 256 GB at $600, for example - that have high read/write speeds (270/200 MB/sec) and fairly high IOPS. Rebuild to the SSD first, which will be far faster than the hard drive rebuild. Then you are protected from dual drive failures (or in Raid 6, a third drive failing).
Then use the SSDs as the 'live drive', and take your time rebuilding to the new hard drive in the array. When done with the latter, take the SSDs back offline.
Obviously this can't be patented either.
Please don't whine about MLC not lasting 50 years, you are going to replace it in the next three or four years anyways. And you are using it for extremely light duty.
How about Build Raid After Destruction with SSD (BRADS).
-Brad Jensen
Reply to this comment
Comment by Neil Cameron on August 28, 2009 2:02 AM
The late Tom Treadway wrote an interesting article on the effect of drive count on raid5 etc.
http://storageadvisors.adaptec.com/2007/07/10/effect-of-drive-count-on-raid-5
Complicated but interesting reading on this subject.
Regards,
Neil Cameron
Adaptec
Reply to this comment
Comment by akro on August 28, 2009 11:46 AM
I think
Triple Mirrors with Solid State Caches on the drives. Is the way to go...
Reply to this comment
Comment by Tom Ruwart on February 18, 2010 2:58 PM
Yes - RAID6 particularly if it is an "always-active" RAID6 implementation like Data Direct has. This means that all data drives and parity drives are read for every access and the data is checked for correctness.
One thing that is becoming more of a problem with large drives is the problem of silent data corruption - the data you read is not the data you wrote but there were no errors in either the write or the read to indicate a problem. The only mechanism that can detect *and* correct for this is RAID6 at the moment. There are other, more complicated techniques employed by NetApp and other NAS vendors that are more efficient but I am not aware of those techniques being available in commodity RAID controllers.
That's my 2 sense.
-Tom Ruwart
www.ioperformance.com
Reply to this comment