Network Infrastructure

Source Side De-duplication

Why don't we see source-side data de-duplication everywhere?

October 31, 2008

2 Min Read

One of the best places to do data de-duplication is at its source. At least that seems to make a lot of sense when you draw it up on the whiteboard.

Think about it -- if you can de-duplicate data at its source, you can minimize the amount of data that travels across the network and still get the storage efficiencies of the target-side de-duplication solutions. Sounds brilliant. So why don't we see source-side data de-duplication everywhere?

The problem is that when you move technology from the whiteboard to the real world, things don't always translate. First, to pull this off you must either have this capability integrated into your backup application or, more likely, you have to get a new backup application. This new backup application is essentially a fairly heavy agent that has to be installed on each server needing protected. This agent, as part of the backup process, scans its local server and compares the bytes/blocks that it has against the other component in the solution, the de-duped storage area.

The movement from one backup application to another is not a decision that is taken lightly and is one that you won't typically do to pick up one feature. Plus, that application you switch to must still provide you with all, or at least most, of the capabilities that you have in your current application.

Another issue with source-side data de-duplication is the amount of client-side work that has to be done. Having the client run the comparison, especially in large data centers with large file counts and capacity, can cause issues. Specifically, the process of making these comparisons is slow. We have seen backup and recovery speeds of about 3 Mbytes/s to 5 Mbytes/s. Additionally, there is measurable impact on CPU utilization of that local host. Restore performance is also hindered to about those same speeds.Source-side de-duplication seems to be most often deployed in either remote office backup configurations, where WAN bandwidth is constrained, or VMware Inc. (NYSE: VMW) environments via integration with VMware Consolidated Backup. EMC Corp. (NYSE: EMC)'s Avamar, for example, does a good job of integrating with VMware's VCB, which is an environment that is almost tailor-made for data de-duplication.

Source-side data de-duplication is often confused with or compared to Block Level Incremental Backup solutions, which we will talk about next.

George Crump is founder of Storage Switzerland, which provides strategic consulting and analysis to storage users, suppliers, and integrators. Prior to Storage Switzerland, he was CTO at one of the nation's largest integrators. Previous installments of his discussion on data de-duplication can be found here.