Recently, a couple of pioneering vendors have introduced disk arrays that will require no maintenance over their three- to five-year lifetimes, and come with warranties that include on-site repairs for the same time span.
If these self-healing devices can live up to their promise of no maintenance, this technology could become a standard feature of disk arrays--and one we'll wonder how we ever did without. For now, organizations with some tolerance for risk could save themselves a bundle by adopting this technology sooner rather than later.
RETHINKING RAID Xiotech's Intelligent Storage Element (ISE), used in its Emprise 5000 and 7000 drive arrays, and Atrato's Sealed Array of Independent Drives (SAID), the key component of its Velocity 1000 array, combine revamped mechanical design, advanced RAID technology, built-in spares, drive scrubbing, and, most significantly, drive rehabilitation to achieve high performance and low maintenance costs, potentially saving customers thousands of dollars per year.
The most visible difference between these systems and a typical midrange disk array is that they don't have front-accessible hot-swappable drives. The typical arrangement of 12 to 16 3.5-inch drives in a 3U package reduces mean time to repair and allows failed drives to be hot-swapped but also limits airflow. Because all the drives are mounted facing the same way, this setup allows multiple drives' rotational vibrations to reinforce one another, causing data errors and premature failure.
Rather than building RAID sets from whole drives, both systems break data into chunks, then create a logical RAID drive by distributing the data, parity, and 10% to 15% spare space across all physical drives. So a 4+1 RAID-5 set will have one parity chunk for every four data chunks, but the data will be spread across all the drives in the system.
Spreading RAID sets across all those spindles isn't a huge breakthrough: Hewlett-Packard's EVA and Xiotech's Magnitude, among others, have been doing it for years. This kind of data distribution has several advantages, though. First is the performance boost from having large reads and writes dispersed across 20 or more drives with independent positioners active at the same time. By spreading the spare space around, these drives are all working to deliver data.
The real innovation is what these arrays do when a drive fails. When a typical RAID controller encounters any drive error bigger than a single bad sector that it can remap to another location on the same drive, it marks the entire drive as bad and stops using it. If a spare drive is available, the controller starts rebuilding the RAID set.
When a self-healing array sees a drive error, it starts the rebuild process but also sends the drive that generated the error to rehab. First, it cycles the power, just to see if there was a firmware glitch--CPUs and firmware on drives occasionally hang. Then, it starts to run diagnostics to determine exactly what's wrong. The array then works to rehabilitate the drive by low-level formatting. If it still finds a head that's bad, it can return the rest of the drive's space to service.
This level of rehab requires close cooperation between the array and drive manufacturers to ensure that the controller knows the logical block addressing or SCSI block to a particular head, if nothing else.
It also means that self-healing arrays won't trust a failed drive just because it powers up and says it's OK. They put a suspect drive through its paces with reads, writes, and long and short seeks, then bring it up to operating temperature and run some diagnostics before releasing it from rehab and using it to store data again.
Duty-cycle management is another way to prevent drive failures. Self-healing arrays will throttle back requests to a drive if the controller sees that the drive temperature is too high.