Ever-growing storage farms, shrinking backup windows, and increasing demand for nearly instantaneous data restores have led to some impressive innovation in the past half decade. Four short years ago, disk-to-disk backup was the hot new technology. The falling price and rising quality of commodity disk arrays made backing up to disk cost-effective, and the advantages, especially on the restore side, were obvious.
Into this hot market in 2004 came a startup called Data Domain, making what seemed to be outlandish claims that its DD400 appliance could not only hold its backups, but could also get data reduction ratios of 10- or 20-to-1. Users, used to getting less than the 2-to-1 data compression that tape drive vendors had been promising for years, had a hard time believing Data Domain's claim.
Fast-forward to the present, and data deduplication has become de rigueur for backup targets. With the possible exception of high-end virtual tape libraries (VTLs) for petabyte data centers, where high performance trumps capacity, the question isn't "Does your backup target deduplicate data?" but "How does your backup target deduplicate data?"
Deduplication technology also is escaping the backup ghetto, spreading to other applications, from archiving to primary storage and WAN optimization. Before moving on to those state-of-the-art applications, let's take a look at how backup data gets duplicated in the first place.
Duplicate data makes its way into your backup data store in two primary ways. Data gets duplicated in the temporal realm as the same files are backed up repeatedly over time. This could be duplicate copies of an identical file stored in the four backups that the usual weekly full backup keeps in a vault with a 30-day retention policy. It might also be the first 900 MB of the 1-GB mailbox file that stores your CEO's e-mail. Since she receives new mail every day, the nightly incremental backup makes a new copy every night. Of course, most of the file remains unchanged, but the whole file is backed up anyway.
Data also can be duplicated in the geographical realm. If you back up the system drive of 50 Windows servers, you now have 50 copies of the Windows server distribution in your backup vault taking up space. At the subfile level, think about how many copies of your company's logo are embedded within the hundreds of thousands of memos, letters, and other documents that fill your file servers.
While data deduplication is generally accepted as valuable technology, there are still a few debates over how and where it should be applied, and none is as heated as the in-line vs. post-process debate. In-line devices, such as those from Data Domain or VTLs based on Diligent Technologies' ProtecTier software, process and deduplicate data in real time, storing only deduplicated data. Since deduplication takes a lot of compute power, the maximum rate at which these devices can deduplicate data limits in-line device performance.
In addition, an in-line device begins replicating deduplicated data to a remote site for disaster recovery as it accepts additional backup data, but post-processing could delay replication for several hours. Post-processing also could reduce performance or otherwise complicate spooling secondary copies off to tape.
The nightmare scenario for post-processing is one in which the post-process data deduplication isn't fast enough to complete deduplicating last night's backups when today's backups begin. The system would need even more space to store today's backups until they can be deduped and will run out of space unless the deduplication process can catch up.
The question users have to ask when choosing between an in-line and a post-process approach is whether the in-line system is fast enough for their backups to complete within the backup window. If it is, the simplicity of the in-line approach is a clear winner. If not, post-processing may be easier than managing multiple deduplication domains on multiple devices.
Today, higher-end in-line systems can ingest data at 200 Mbps to 400 Mbps per node. Data Domain and Quantum can build multinode clusters that are faster, but each node in the cluster represents its own deduplication domain, using more disk space and complicating management. Since the process is mostly CPU-bound, every time a new generation of processors comes out, vendors can boost performance by about 50%. Later this year, Diligent plans to release a two-node cluster version of ProtecTier that will share storage and act as a single deduplication domain while ingesting data at about 700 Mbps and eliminating the gateway as a single point of failure.
(click image for larger view)