Don't Skip A Beat

It's no longer enough to have a backup solution and a business continuity plan. The newest goal for disaster recovery calls for providing the exact level of recovery each application

April 1, 2005

14 Min Read
Network Computing logo

Comair, a subsidiary of Delta Air Lines, knows the toll that inadequate disaster recovery technology can take. When the airline's flight assignment system couldn't handle the thousands of weather delays and cancellations that came in on Christmas Day, a computer error occurred that forced the company to cancel all 1,100 of its flights, leaving 30,000 passengers stranded in 118 cities, says Fred Cohen, an analyst at the Burton Group. Episodes like this, coupled with the possibility that a terrorist attack, power grid failure, hurricane, volcanic eruption, flood, or computer failure could strike at any moment, are making network managers all over the country reconsider their business continuity infrastructure.

Fortunately, technology is rising to the occasion. Replication and backup technologies are converging to make sophisticated recovery management doable beyond the mainframe world. In the near future, software will help you calculate the cost of downtime for any application or department and apply that information toward replication and backup decisions. For example, you might decide to protect a business-critical application that needs to be up and running within three minutes with state-of-the-art replication, whereas a less-critical application that can be down for an hour without significant loss can be protected with less-expensive backup.

For those who can't afford full-scale server replication or the services of providers who host them, two relatively new technologies--replication appliances and virtual servers--are bringing lower-cost replication to Windows and Unix environments. Replication appliances allow specific volumes of data to be replicated across a network, rather than replicating the entire storage system itself. Virtual servers provide a hardware-independent and potentially more affordable mechanism for handling replication.

On the traditional backup side of business continuity, technology improvements are bringing about better performance, maintenance, and user-friendliness. The falling price of disk is allowing companies of all sizes to transition backups from tape to disk, allowing for faster recovery, as well as giving IT a break by enabling users to recover their own files.

The newest and least-proven business continuity technology is workflow or business process automation. This promises to help IT managers map all the steps required to recover systems and execute those steps in a crisis.With technology solutions for business continuity becoming more sophisticated and affordable, it's now more a matter of deciding which pieces best fit your IT infrastructure.

SPEEDY RECOVERIES

A business continuity strategy once meant backing up your e-mail and file servers. Today, companies that have been burned by power or hardware outages are looking for extremely high availability and lightening-fast recovery times. Storage vendors are responding.

"We're moving our conversations from backup windows and data protection to a focus on recovery management service levels," says Sheila Childs, chairperson of the Storage Networking Industry Association (SNIA)'s Data Management Forum. This means letting customers set specific recovery time and recovery point objectives, then using the right mix of replication and backup technologies to make those objectives reachable. A brokerage firm's trading system might have a recovery time objective of two minutes and thus require a replica, for example, whereas a marketing program might call for a one-hour recovery time, in which case disk-based backup might suffice.

Recovery management capabilities have existed on mainframes for some time, but enterprise recovery management software is just about to enter the open systems world. Today, there's no single software program that will let you view and set recovery objectives across every application and system in your network, although some software backup products will allow point-in-time recovery. Childs predicts that in the future, there'll be a universal console for monitoring and altering system recovery time expectations. Industry statistics are available to help calculate the cost of system outages. For instance, according to the Yankee Group, an hour of downtime costs $4.5 million to a brokerage firm, $2.6 million to a bank, and $1.2 million to a major media firm.The key to fast recovery is, of course, replication. Whereas replication used to be a luxury reserved for high-end data centers, replication appliances are making it easier and more affordable for small and mid-sized organizations. When the primary storage array in California-based Cuesta College's 2TB SAN failed, its replication appliance--FalconStor's IPStor--kicked in and redirected storage traffic to other disk arrays, automatically promoting mirrored disks to primary status. There was no change in server performance, and no data was lost.

Replication appliances like FalconStor's provide replication over existing LAN and WAN networks and cost around $10,000 to $100,000 per site. Other replication appliances on this level include McData's MirrorStore and UltraNet, and Xiotech's TimeScale.

CONTINUOUS DATA PROTECTION

The next best thing to replication that approaches recovery management service levels is Continuous Data Protection (CDP), a time-addressable form of backup that not only makes continuous copies of every transaction, but can recover a system to any point in time, including milliseconds ago. CDP enables rapid data restoration through the use of good synchronization points and preservation of all writes against files.

A notable product in the CDP category is Revivio's Continuous Protection System, which Forbes.com is using to restore applications, re-create data in the event of an archival backup failure, and automatically audit archival data and conduct forensic analysis of failure events. Other vendors in this space include Alacritus, Hitachi Data Systems, InMage Systems, Mendocino Software, Mimosa Systems, NetApp, TimeSpring, Scentric, Storactive, Sun Microsystems, and XOsoft.An almost-CDP product coming out this summer is Microsoft's Data Protection Server. The server will make many copies of data and let users decide which ones they want to restore.

THIRD-PARTY HELP

For high-end disaster recovery, many companies turn to third-party service providers such as IBM, SunGard, or HP for help. For around $30,000 a month (depending on software, hardware, and service levels), these providers offer state-of-the-art, geographically remote facilities with temperature and humidity control, top-notch security, and redundant everything. For companies such as banks and brokerage firms that don't have geographic diversity but require high availability in the case of a disaster, such services make a lot of sense. If your building is ruined, you declare an emergency, then send key staff to the provider's facility with your backup tapes and whatever else you need to get your network going again. During the Florida hurricanes last year, many companies in the state relied on such services, with generally positive results.

A potential downside of third-party disaster recovery services is that in a large crisis, you're not the sole client. When a 6.8 earthquake shook Washington State in early 2001, Mutual of Enumclaw called its mainframe disaster recovery provider within five minutes of the event, only to find that six other large customers, including Microsoft, had declared ahead of it.

"We didn't think we had any damage, but we weren't sure. The building still had to be inspected, and we had evacuated it," recalls John Weeks, IT director at the insurance provider. As it turned out, Mutual of Enumclaw was able to run off its generator at a redundant site without skipping a beat.But when the company was scheduled to perform a disaster recovery test on September 17th that same year, the provider was tied up with 39 declarations it had gotten from huge Fortune 100 companies following the September 11th attack on the World Trade Center. Weeks was told he could come and do his test, but his staff wouldn't have any resources or support. These incidents were enough to cause Weeks to explore other options.

In general, companies that get the best results from such services put thought and care into the drafting and monitoring of the service level contract. There have been cases of clients who thought they were getting an exact, one-to-one duplication of their hardware infrastructure, when the contract in fact stated that the provider would provide equal or like hardware. In the case of custom, mission-critical code written down to the hardware level, "equal to" may not be good enough.

Other times, companies have added large numbers of new servers to their IT infrastructure without letting the service provider know or making the appropriate changes to the contract. It's helpful to spend time with the provider discussing disaster scenarios and stepping through exactly what will take place and what technology will be available before signing or renewing a contract.

VIRTUALIZATION

Not everyone can afford replication or high-end disaster recovery services. An option that some companies are turning to is virtualization. Virtual servers can perform load balancing, make fuller use of existing servers, and provide disaster recovery at the same time. Virtualization has its roots in the mainframe world of the 1970s, but is new to Linux and Windows. (For more on virtualization, see "Linux Virtually Ready For the Data Center".)Virtual servers have the advantage of being hardware-agnostic. Whereas changing the network card on your primary server usually means making the same change on your backup server, that's not so with a virtual server. And instead of having a one-to-one mapping between primary servers and secondary servers, virtual servers let you cluster physical servers to virtual servers in such a way that one secondary virtual server could, for example, support three physical primary servers.

Mutual of Enumclaw runs Citrix to its WAN, which includes 15 other offices in Washington, Idaho, Oregon, and Utah. The application resides on a virtual server farm composed of IBM xSeries servers that run VMware software. Virtualization has reduced the number of physical servers the company runs by about 35 percent. Disaster recovery is now provided by virtual servers, also running on fewer machines.

On the downside, having virtual servers within one location or nearby won't help in a large-area catastrophe such as a regional power grid failure. And depending on how you control the workload on the virtual servers, it's possible to overtax a physical server and cause all the virtual servers on it to crash. But with the right configuration, virtual servers can provide efficient disaster protection. You might, for example, have physical servers in Chicago and Tampa, each running two virtual machines, that fail over to each other. If one machine goes down, the other can handle the processing for both locations. Meanwhile, neither server is going to waste.

SELF-SERVICE RECOVERY

While virtualization is providing a new option for disaster recovery on the server level, laptops, PCs, and non-mission-critical applications are still in the mode of ordinary backup. The good news is that the latest backup tools are replacing tape with inexpensive disk and letting end users recover their own files. Both these improvements should save IT time and labor.At Telekurs Financial, about 50 percent of the company's critical information resides on the laptops of offsite developers and sales representatives in the field. A hosted application from storage software provider Connected backs that data up on its servers in Natick, MA. Users are empowered to restore whichever files they choose. At least 25 percent of Telekurs' users, including the CEO, CTO, and CFO, have had some sort of personal disaster, such as a virus or a hard disk crash, where they've recovered their own data from Connected.

BUSINESS PROCESS MANAGEMENT

Whether you use virtualization or replication in your business continuity efforts, you might want to consider a new technology for managing the IT processes required to get the organization running again. Business process management software lets IT managers draw a process map of everything that must be done in an emergency. This workflow should theoretically kick into gear when catastrophe occurs. Once again, this technology already exists for mainframes--a notable example being IBM's Geographically Dispersed Parallel Sysplex, which provides high-end business continuity--but it's new to Linux and Windows.

Credit Agricole, France's largest bank, operates in 60 countries, with its IT department spread over three locations. The bank bought C20 software from software start-up Optinuity to automate production tasks, such as the controlled shutdown and restart of all major applications, including backup and failover site procedures used to ensure service continuity. Now the bank says one person can manage the whole disaster recovery process, whereas before it required 10 specialists in the various platforms to rebuild the software environment in the backup site, restore data, and run integrity checks.

One challenge to this type of software is that it requires IT staff to document their jobs, roles, and responsibilities, so issues of unwillingness and lack of time for such forward-thinking projects may crop up. But properly created and executed, it could serve as a living, active IT handbook that helps make sure nothing falls through the cracks when disaster strikes.FUTURE PROJECTS

Most people contacted for this story feel there's more work to do on disaster recovery. "We treat it as a life insurance policy, and we've not done justice to the infrastructure," says Mutual of Enumclaw's Weeks. "We're finding that to be adequately prepared, typically you have to spend 15 to 20 percent of your non-labor IT budget on a disaster recovery solution. We spend around 3 to 5 percent."

Wally Bedoe, vice president of operations at Telekurs, estimates that about 95 percent of his company's data is being protected. He's looking into Connected's server backup software for protecting centralized servers that aren't mirrored--those for the intranets, for example. He's also rolling out a disaster recovery Web site this year that will provide clients with emergency information such as who to call in a disaster.

Network Magazine readers who responded to a recent survey on disaster recovery said they would slightly increase their disaster recovery spending in 2005. Over the next three years, they plan to buy more backup storage, cell phones, emergency contact databases, and mirrored servers, as well as establish remote data centers.

A promising technology to look out for as you plan ahead is storage grids. "Vendors are starting to say this is their future," says Randy Kerns, senior partner at the Evaluator Group. "The idea behind the storage grid is to federate a lot of granular storage systems and have them replicate data amongst themselves so that you never have an exposure. Then you add intelligence into the grid to control that replication. We'll see this develop over the next two to three years." HP and NetApp are working on storage grid offerings.Senior Technology Editor Penny Lunt Crosman can be reached at [email protected].

Disaster Recovery Tips

There's much to consider about business continuity and how to handle people, processes, and technology when storms, terrorists, or aggressive spambots strike. Experts and survivors offer a few suggestions to help your recovery effort:

Have a well-thought-out plan. While most network managers are already overworked, certain basics must be defined, documented, reviewed, and practiced, says John Medaska, vice president at Relational Technology Services, an IT consulting firm. If the data center goes down, which executives need to be notified? Who's responsible for finding the tapes and getting systems backed up? Which servers need to be restored first? Such procedures should be determined with the feedback and support of C-level executives, written down, and kept in more than one safe place. The plan should also be kept up-to-date with a current inventory of all hardware and software.Don't wait to declare an emergency. When the first hurricane hit Florida last year, says Peggy Pinkerton, PLC systems manager at Proudfoot Consulting in Palm Beach Gardens, "All that happened was that we lost power. We expected to get it back within a day or two, but it took much longer than we thought." In fact, the company's servers were down for a couple of days. Lesson learned: Don't wait for things to improve--declare at the point of disaster.

Plan for the worst. The 2004 hurricanes affected several states, as did the September 11th attack and other disasters of the last few years. Having a redundant data center in the same city or state is no longer enough protection. Replication and other disaster recovery technologies should ideally be located 200 miles or more away and tested at least every six months. Employees should know who to call and where to report if your company's building is destroyed.

The Trouble With Tape

Many network and IT managers are disillusioned with tape because it's unreliable and cumbersome. Ian Butler, technical analyst for the University of Oklahoma's Biological Survey research department, is a prime example. As at many colleges and small businesses, professors and researchers at the university keep files and data on PCs and laptops, not always bothering to save to a network server or perform disciplined backup. Network servers used to be backed up with a single 8mm tape drive, but two tape disasters made Butler sick of tape: First, a disk crashed that had a configuration file that wasn't backed up. The data was eventually restored, but with a lot of effort. Then a SCSI failure on a tape drive mangled the tape.As data volumes grew, the drives needed to be fed more and more tapes. Butler had many other things to do and needed a more hands-off solution. He settled on two Adaptec Snap Servers--a 4400 and a 4500--in a RAID 5 configuration with one-way mirroring. They were each up and running within a few minutes. Users could use their normal Windows logins, and network users appeared in the configuration immediately.

While the Snap Servers haven't been completely trouble-free, their problems are much easier to fix. When a disk crashed recently, a hot spare filled in. When a fan on the other Snap died, Adaptec sent a new one (the university purchased support along with the hardware).

SUBSCRIBE TO OUR NEWSLETTER
Stay informed! Sign up to get expert advice and insight delivered direct to your inbox

You May Also Like


More Insights