Understanding Continuous Data Protection

A critical element of total backup systems, CDP products can help you find that needle in your data haystack. More importantly, they offer restoration capabilities that tape, replication and snapshot

June 23, 2005

9 Min Read
Network Computing logo

Replication--saving changes on a regular or even near-constant basis to files on a separate local or remote array, using products such as Veritas Replication Exec and Computer Associates BrightStor High Availability--has become standard. If you lose a disk, replication will get you up and running quickly; and in the event your data center burns down, you'll have an up-to-the-minute backup if you replicated to a remote site. But replication has some drawbacks. If someone overwrites a file or a virus infects your data center, replication systems will copy these unwanted changes along with everything else, and your backup copy will be corrupted too. And if you want to restore a file to a certain state, replication is useless. It holds the file in its current state.


Continuous Data Protection Vendors
Click to Enlarge

Snapshots have been growing in popularity, with disk-array vendors EMC and Adaptec leading the charge. Snapshots are regular, incremental backups to disk. Some snapshot implementations copy the entire area being backed up each time, though most newer ones copy only the changed data. A single disk can hold many snapshots, so you can choose the newest, uncorrupted version for your restore. But with a snapshot, as with tape, restores occur at the file or volume level. Also, a valid snapshot of the data before corruption may not be available when you need it, and after you restore a file to precorrupt status, you lose the changes made since the last snapshot was taken.

CDP to the Rescue

For a total storage solution, we must retain information about every change to a file over its life and be able to restore any version of that file. That's the promise of continuous data protection. Toward that end, the Storage Networking Industry Association has put together a working group to develop its Data Protection Initiative, a set of standards for CDP and other disk-to-disk technologies.CDP products let you go back (scroll back or rewind in vendor-speak) to any version of a file that's been saved. It's not limited to certain times or directories. In fact, some CDP products let you restore individual e-mail messages or database transactions. You can choose to restore only the data that has been lost or corrupted, and restore it to the exact version you need by selecting from a list of date/time modifications.

Sounds appealing, doesn't it? It is, yet CDP, too, comes with a cost and should be considered just one component of a total backup system.

CDP requires a repository where all file changes can be maintained. After updating the same file many times, restorations are bound to slow down, because the system applies all the changes ever made to the base document. Some CDP implementations require agents on specific applications; it's far easier to get to the single-message restore level if you have an agent on the mail server that sees each message as it is changed within the e-mail system.

CDP is good for protecting a limited subset of transactional applications, but what if you need to restore a single e-mail message or database entry instead of the entire file set? With CDP for e-mail and databases, you can select a message that was accidentally deleted and restore it, or you can bring back a single database table to the state it was at a specific time of a specific day. To determine if CDP fits your needs, the best question to ask is, "Will I ever need to restore a single piece of this data?"

CDP is ideal for retrieving a file or piece of data, but you wouldn't want to use it for a complete system or volume restoration. The time to restore is based on the number of modifications--the more changes that have occurred, the longer the restore will take. Most CDP systems let you say "baseline from this date," which equates to committing all tracked changes to a file. Once the baseline file is updated, it becomes the new baseline. This functionality can be automated to keep restore times down. But creating these new baselines reintroduces the question most common to snapshots: How far back should you maintain the file? Most IT shops stick to the time lines tape rotations provide, but the flexibility of CDP and the ever-falling cost of storage will likely make it possible to store more data for longer periods of time. Balancing restore times with length of historical tracking is sure to become a hot CDP topic.The idea of restoring to any point in time is so appealing that everyone seems to be jumping into the CDP arena. Conventional backup software vendors such as Veritas, replication vendors like FalconStor, start-ups such as Revivio, and even Microsoft--which announced a servers-only replication and CDP product named Data Protection Manager for a 2006 release--have announced or shipped CDP products.

Vendors have implemented CDP in three different ways. The first and most common is what we'll call file-replication CDP, though vendors also call it file-level CDP. A file-replication CDP system watches the drive and logs all changes to files as they occur or shortly thereafter. This approach requires no special agents or drivers for individual applications whose files you wish to support. Single files are easily distinguished in the repository and during restores.

However, file-replication CDP is relatively dumb. It knows nothing about database log files or commits, and you risk putting your database or other transactional applications in an unstable state upon restore. Your database and e-mail administrators may have to reapply logs to restore the application to the point at which it became corrupt. To be fair, most backup technologies have similar drawbacks.

The second approach is application-aware CDP, which requires an agent on an application or host. That agent has deep knowledge of the application's file structures. Databases, e-mail servers and even Microsoft Office documents are common targets of application-aware CDP agents, whose goal is to increase the granularity of restores beyond the file level, down to the data level. These agents let you restore just a part of a file--say, a single deleted e-mail message or the deleted mailbox of a terminated employee. And why restore an entire 200-GB Oracle database if all you need is one little 40-row reference table?

Application-aware CDP agents, however, are this technology's strength and weakness. Adding an agent that will do something each time a transaction occurs implies a performance hit. We haven't done a comparative review of these products yet, but our initial experiences have caused us to worry about restoration speed in an environment that's updating hundreds of thousands of database rows per hour. Because the data must be re-created by applying changes in the order they occurred before the restore can be completed, we expect restore times to rise rapidly in a higher-performance environment.A third alternative is block-replication CDP. This type of CDP knows nothing about files or directory structures and just logs all changes to the disk image as they occur. Block-replication CDP is a decent solution for bare-metal restores, but is little more than mirroring that can be rolled back. We don't expect this approach to last very long because it's not granular enough. Entire volumes are protected, but there's no provision for protecting a file or directory. Block-replication CDP will probably be rolled into implementations that use file-replication and application-aware CDP for disaster recovery, but because of the volume of change that generally occurs on an entire volume, we don't expect customers to use it much. Replication and snapshots are much better at this type of protection.

Where is the future of CDP? Eventually, all competitive markets follow the desires of customers, who want a combination of application-aware CDP for their transactional systems and file-level CDP for most other data. Most vendors we've interviewed still expect the market to move one way or the other, or for these two technologies to stay in totally separate markets: enterprise-class application-aware CDP and small-business file-level CDP. We, on the other hand, believe these markets will converge over time. If the cost of processing power and raw storage continues its downward trend, CDP could even wind up on the desktop. How nice it would be to be able to restore word processing and spreadsheet files for any user at any time to any state.

Appliance or Application?

As is true with many relatively new technologies, most CDP products today are sold as appliances. Vendors market their products this way because the repository and all the necessary supporting software can be sold as a package on commoditized hardware. Customers can't install the software on slow machines, and the vendor need not support multiple databases for repositories. FalconStor chose this approach for its IPStore with the TimeMark Option. However, a few CDP products, such as XOSoft's Enterprise Rewinder, are sold as software that you install on your servers--some with a built-in repository database, some using your databases for a repository.

Eventually, CDP may even be integrated into SAN fabrics--either with a side-armed appliance, like Revivio's Continuous Protection System already does, or directly in the switching fabric. CDP requires hefty processing power, though, and intelligent SAN fabric hardware doesn't yet have that level of processing capability built in.The Big Picture

CDP products offer a level of granularity in backup and restore not available a few years ago. The amount of processing power required and the complexity of ensuring correct file restores from logs of changes made the initial implementations difficult, but there are enough products in use every day that the category is coming of age. Most disk-to-disk backup solutions provide quick restores, and frequent snapshots can offer point-in-time restores. CDP brings recovery to the next level: true file-level and data-level restoration.

Our biggest concern is the processing power required to restore multiple files saved many times since the baseline image was taken. Vendors have taken steps to minimize this time commitment or give you more control over it, but this is a function of the CDP architecture. Make certain you know what you're getting, and how long it will take to perform a restore in a large repository with frequently modified files. We recommend applying CDP only to data you need highly granular restores for--most notably, databases and e-mail. Tape is still best for archiving, replication or snapshots are best for disaster recovery, and CDP is best for point-in-time or "best known version" data item, file or folder restores.

Don MacVittie is a senior technology editor at Network Computing. Previously he worked at WPS Resources as an application engineer. Write to him at [email protected].

SUBSCRIBE TO OUR NEWSLETTER
Stay informed! Sign up to get expert advice and insight delivered direct to your inbox
More Insights