Restoration Is The Issue
Many tape-backup-and-restore software vendors concede that the industry has always emphasized backup capabilities over restoral capabilities. Before the advent of SANs, which provide a means of handling large-scale data movements (backups, data replication and so on) on a network separate from the production LAN, the primary concern of end users with respect to backups was time. As organizations demanded 24x7 operation from their applications, windows of opportunity to take systems offline to perform backups were shrinking. Backup speeds were paramount.
Tape-drive vendors began producing higher-capacity tape formats, faster drives, robotic libraries, and even parallel channels and tape RAID to support efforts to move data off disk drives and onto tape much more quickly. Restoration speed was often an afterthought in customers' purchasing decisions; backups were viewed as a necessary burden -- an insurance policy that, one hopes, would never have to be accessed.
The speed of data restoration (time to data) is affected by several factors. Some people would include several "prequel" steps -- obtaining the tapes from off-site storage, transporting them to the recovery site, configuring them into a robotic library and verifying their data integrity -- as important components of the time-to-data calculation. When a robotic library is used, time must be added for the proper tape to be identified, or "picked," and positioned in the drive by the robot. Depending on the manufacturer, robotic picking and loading requires up to 30 seconds per tape. Getting to the data on the tape and beginning a transfer requires from 30 seconds to one minute, depending on the type of tape. All these time factors need to be considered before the actual data-transfer process is evaluated.
Once the tape is positioned for reading and data restoration, the speed at which the tape drive is capable of reading data recorded on the tape is a significant issue. In the open-systems world, rated speeds for popular tape formats range from 6 MB per second for DLT8000 drives to 15 MB per second for LTO Ultrium tape. A second factor is the rate at which the drive interface or interconnect (for example, SCSI or Fibre Channel) can transmit data to the destination device. Interconnects are approaching 1 GB per second, but the capability of disk devices to use all that bandwidth varies widely.
Many vendors rate their data-restoration capabilities on the basis of these two factors alone, creating a rather deceptive impression of the speed with which a terabyte of data can be transferred from a tape to a disk. In the real world, according to backup-and-restore software vendors, most users can consider themselves fortunate if the actual performance of their restore operation approaches 30 percent of the vendor's rated speed. To understand why, consider the following mitigating factors.
If tape data is being sent to a storage device directly, the speed at which that device can record the tape-data stream is important. While calculating the speed at which data can be written to a single disk drive is easy, determining the data-restoral speed of a RAID array can be more challenging. This is due to the operation of the RAID controller itself. A RAID 5 controller, for example, must work to ensure that parity information is recorded with data. It performs two write operations for every write command it receives, resulting in a measurable slowdown of the data-writing process. Without special workarounds, this delay can double tape transfer times in data restoral. (This "write penalty" dilemma existed for nearly two decades before SANs, but it anticipated some of the potential problems with SAN virtualization schemes.) In a high-end array that performs RAID tasks as well as on-array virtualization, the controller may also slow down data restoration as it directs data streams to logical partitions configured within the array.
The storage device's capacity is also a factor. Fifteen years ago, tape capacities exceeded the average disk drive capacity by a ratio of 23 to 1. By 1994, the year that Exabyte Corp. introduced its MammothTape technology, providing an unprecedented 20 GB of native capacity, Seagate Technology was just bringing its 2.1-GB Barracuda server disk drive to market. Today, the highest-capacity tape in the open-systems space is the 110-GB SuperDLT from Quantum Corp., while a single large server disk, such as Seagate's Cheetah, holds about 74 GB, for a ratio of about 1 to .75. This practically mandates the use of tape libraries to restore even a modest amount of data and is an important consideration when calculating the time to data for a multiple-disk array.
Although direct tape-to-disk (sometimes called server-free or LAN-free) backup and restore strategies are increasing in popularity, chances are that a server will be involved in the restore process. The server will host a software application such as Veritas NetBackup, Legato NetWorker or Computer Associates ARCserve 2000 to facilitate the data restoration, and both the server processor and the bus architecture can impose latencies in the restore process.
Moreover, software components such as the server operating system, file system and restore application may exact a toll by limiting the number of concurrent or parallel data streams that can be supported in the restoration. (Concurrency and parallelism have been introduced to reduce the time for backups and restores by increasing the number of connections between the tape and the disk devices.) Server file systems can also generate overhead in the restore process as they record and organize data that's being written to attached or tethered drives.
Taken together, these factors throw several wrenches into vendor estimates regarding data-restoration speed. Without factoring in software-, server- or virtualization-related overhead, vendors estimate that 24 hours is required to restore from tape to disk 5 TB of data recorded on LTO Ultrium tapes in a tape library. That estimate includes 24 minutes to swap 13 100-GB tapes. Performing the same task with DLT8000 technology requires 60 hours, with 42 minutes consumed in swapping the 40-GB tape cartridges. With virtualization schemes added to the equation, the time required to restore 5 TB of data could be much greater.
Given all the variables associated with a specific storage configuration, it's easy to see why vendors don't offer binding performance guarantees on data-restore operations. As with automobiles, your mileage will vary.
Mirror, Mirror
With all the inherent limitations of tape-restore operations, it's no surprise that many recovery-service vendors offer disk mirroring as a solution. Both traditional disaster-recovery vendors and newcomers to the high-availability market, such as SSPs (storage service providers) and Web-based hosting companies, seek to leverage the value of disk mirroring to become the guarantors of corporate survival.
Mirroring encompasses a number of strategies, ranging from symmetric, near-real-time storage replication, to asymmetric, time-delayed solutions, to site replication. All of these strategies have two things in common: They provide time-to-data advantages over tape-based storage-recovery strategies, and they usually carry a hefty price tag.
Symmetric and asymmetric mirroring are not new technologies. In a symmetric-mirroring configuration, data writes are made to two (or more) arrays at nearly the same time. Data is written to Array 1, then further updates to that array are held in queue until the same write can be made to Array 2. In the past, symmetric mirroring was practical only within the confines of the data center, with mirrored arrays collocated with one another. Placing the secondary array a significant distance from the primary resulted in a latency delay and poor performance of the overall system. As high-speed MANs (metropolitan area networks) are rolled out in certain major cities, the greater bandwidth makes symmetric mirroring practical. (See our case study, "Mirrored Redundancy Is Key to Utility's Storage-Recovery Strategy".)
Both SSPs and traditional recovery-site vendors report an uptick in subscriptions to their symmetric mirroring services, though the customers tend to be Fortune 1000 companies or other enterprises that stand to realize significant losses for even a modicum of downtime.
If a company does not have access to a fiber optic MAN operating at core carrier speeds and can't afford to "light the fiber" itself, asymmetric mirroring is an alternative. Asymmetric mirroring usually entails three arrays: Arrays 1 and 2 are collocated and symmetrically mirrored. Arrays 2 and 3, which are geographically remote from each other, communicate mirrored data across a slower-speed network. As a separate mirror operation, the exchanges between Arrays 2 and 3 do not impose a latency penalty on the production system and mirror.
Of course, in an asymmetric mirror, the data on Array 3 is always out of sync with Array 2. The length of the "mirror gap" -- the difference in data between the arrays -- is determined by the distance between the arrays and the bandwidth offered by the interconnecting network. A company considering this option must weigh the costs of some lost transactions against the advantages of having most data ready for use in the event of an unplanned interruption. Moreover, the cost for not only two but three storage arrays, and for the interstitial network, must be factored into cost-benefit analysis.
The above also applies to mirroring arrangements involving SANs. With a SAN, data can be routed to multiple targets via a switch. The major expenses involved include the cost for an identical (or compatible) storage infrastructure at a remote location and the bandwidth used in the solution. A growing number of recovery vendors and SSPs have introduced a menu of service offerings that provide a certain time to data at a specified price.
Not to be ignored are Web-based hosting services, the so-called new-age data centers. Increasingly, these organizations are leveraging their multilocation infrastructures, interconnected by core carrier networks, as a panacea for companies that need high-availability storage- and facility-recovery strategies. If your IT infrastructure uses a rack-mount data-center model, a Web-hosting company may be able to fulfill all or part of your recovery strategy with mirrored failover services.
Mirroring, it should be stressed, is not a replacement for tape, contrary to the position adopted by some array vendors. Mirroring is prone to data downtime, particularly from corruption wrought by malicious and misbehaving software. When erroneous data is written to one mirror, it is replicated on another. Without some means of restoring data to a precorrupted form, such as via tape backups, a mirroring strategy may not provide the airtight storage-recovery mechanism that business-continuity planners expect.
No Panaceas
Mirroring is not a comprehensive solution in any environment in which multiple storage topologies are in play. Moreover, poorly managed data storage -- in which a significant volume of noncritical data (or replicated data) is mixed inexorably with critical data -- multiplies the cost of mirrored solutions.
In the end, a sound storage-recovery strategy must begin with a sound storage-management strategy. Companies need to take inventory of their storage, establish a migration path to a strategic storage infrastructure, and invest in the skills and technology required to put storage into a recoverable form. Even if your backup plan involves straightforward tape backups, coming up with a regular scheme and managing it actively are essential. (See our tape backup workshop, "The Hows and Whens of Tape Backup.")
In the past, contingency planners were presented with a problem and told to do their best with the hand of cards they were dealt. In the modern world of storage, contingency planning and recoverability requirements must be factored into the design process.
Jon William Toigo is an independent consultant and author of 10 books, including The Holy Grail of Data Storage Management and Disaster Recovery Planning (second edition, both from Prentice Hall PTR, 2000). Send your comments on this article to him at jtoigo@intnet.net.