Network Computing is part of the Informa Tech Division of Informa PLC

This site is operated by a business or businesses owned by Informa PLC and all copyright resides with them. Informa PLC's registered office is 5 Howick Place, London SW1P 1WG. Registered in England and Wales. Number 8860726.

Source Side De-duplication

One of the best places to do data de-duplication is at its source. At least that seems to make a lot of sense when you draw it up on the whiteboard.

Think about it -- if you can de-duplicate data at its source, you can minimize the amount of data that travels across the network and still get the storage efficiencies of the target-side de-duplication solutions. Sounds brilliant. So why don't we see source-side data de-duplication everywhere?

The problem is that when you move technology from the whiteboard to the real world, things don't always translate. First, to pull this off you must either have this capability integrated into your backup application or, more likely, you have to get a new backup application. This new backup application is essentially a fairly heavy agent that has to be installed on each server needing protected. This agent, as part of the backup process, scans its local server and compares the bytes/blocks that it has against the other component in the solution, the de-duped storage area.

The movement from one backup application to another is not a decision that is taken lightly and is one that you won't typically do to pick up one feature. Plus, that application you switch to must still provide you with all, or at least most, of the capabilities that you have in your current application.

Another issue with source-side data de-duplication is the amount of client-side work that has to be done. Having the client run the comparison, especially in large data centers with large file counts and capacity, can cause issues. Specifically, the process of making these comparisons is slow. We have seen backup and recovery speeds of about 3 Mbytes/s to 5 Mbytes/s. Additionally, there is measurable impact on CPU utilization of that local host. Restore performance is also hindered to about those same speeds.

  • 1