Source vs. Target Deduplication: Scale Matters
I had a nice conversation with the CEO of a backup software vendor, who shall remain nameless, at last week's Exec Event storage industry schmooze-fest. At the event, the CEO asked why I thought target deduplication appliances like those from Data Domain, Quantum and Sepaton were still around. Why, he asked, doesn't everyone shift to source deduplication since it's so much more elegant?
February 3, 2011
I had a nice conversation with the CEO of a backup software vendor, who shall remain nameless, at last week's Exec Event storage industry schmooze-fest. At the event, the CEO asked why I thought target deduplication appliances like those from Data Domain, Quantum and Sepaton were still around. Why, he asked, doesn't everyone shift to source deduplication since it's so much more elegant?
By running in agents on the hosts, source deduplication leverages the CPU horsepower of all the hosts being backed up to do some of the heavy lifting inherent in data deduplication. This should reduce the CPU horsepower needed in the target system and thus hold down its cost. While all deduplication schemes minimize the disk space your backup data consumes, deduplicating at the source minimizes the network bandwidth required to send the backups from source to target.
Since most branch offices run a single shift--leaving servers idle for a 12-hour backup window--and WAN bandwidth from the branch office to the data center comes dear, source deduplication is a great solution to the ROBO (remote office, branch office) backup problem.
As a result, and because of the generally abysmal state of ROBO backup at the time, early vendor marketing for source deduplication products such as EMC's Avamar and Symantec's PureDisk pitched them as ROBO solutions.
Source dedupe fits well wherever CPU cycles are available during the backup window. If bandwidth is constrained, such as in a virtual server host backing up 10 guests at a time, even better. Since it's just software, the price is usually right. And since vendors have started building source deduplication into the agents for their core enterprise backup solutions, users don't even need to junk Networker, Tivoli Storage Manager or NetBackup to dedupe at the source.One major sticking point remaining is that most source deduplication systems can't hold as much data as a DD890 or other big honking backup appliance. By all accounts I can find, Avamar is the best-selling source deduplicating software package today. However, an Avamar data store can only grow to a 16-node RAIN (Redundant Array of Independent Nodes) with a total capacity of about 53TBytes (net after RAID but before deduplication), while a DD890, also from EMC, can hold about 300TBytes.
53TBytes before deduplication, which probably means 300TBytes to 800TBytes of actual backup data, is a lot of storage for most of us. Some large enterprises with more data than that would have to create multiple repositories. These repositories, in addition to being more work to manage, will reduce the deduplication rate because each repository will be an independent deduplication realm.
Disclaimer: EMC, Quantum and Symantec are, or have been, clients of DeepStorage.net, of which I am founder and chief scientist. Of course, they're on both sides of this question.
About the Author
You May Also Like