CommVault was one of the first enterprise back-up software vendors to integrate deduplication into their product offering. While it wasn't a surprise that they did this, it was a surprise that they were able to add the ability to deduplicate to tape. In my most recent regular blog post on Network Computing, I covered the trend to deduplicate on media other than traditional disk-based systems. CommVault's hybrid approach is a unique compared to how others perform deduplication, and worth examining.
Traditionally deduplication is done in just one area of the backup tier, either at the client, at the backup server (called a "media agent" in a CommVault environment) or on the backup storage target. While some vendors will allow you to choose where to deduplicate data from among these three locations, the workload for that deduplication is only done on the tier you select. CommVault uses a more hybrid approach. At the client they have always segmented the data to be backed-up into blocks, and they have had the ability to compress that block prior to sending. Now with deduplication, they add one more step, executing the algorithm to generate the hash code needed for deduplication. This offloads the creation of hashes from the back-up or media server. Unlike source side deduplication products, however, the CommVault agent does not perform a look-up to see if the block already exists on the backup server. The block is sent with its hash. The backup or media server performs the deduplication look-up as it receives the block.
By distributing the workload across the back-up tiers, CommVault feels that they alleviate the performance concerns of both source-side and target-side deduplication methods. Although the respective providers of those solutions will debate the point, source-side deduplication may impact performance of the client while performing the look-up for redundant data, and target-side deduplication can become a bottleneck when receiving that data. CommVault potentially gets around the problem by spreading the load throughout the back-up tier. For clients where even the five percent load of creating hashes is too much impact, there is an option to have the media server create the hash, but of course that increase in the performance requirements would need to be accounted for before configuring hash creation on the media agent(s).
The downside to this hybrid approach to deduplication vs. a pure source-side deduplication method is that CommVault must transmit all the data (compressed or uncompressed) across the network between the server and the media agent, redundant or not. This is not unlike target-side deduplication technologies that must also receive the entire back-up. The result is that there continues to be the need for a robust back-up infrastructure, at least between the server and media agent. CommVault will deduplicate the data sent from the media agent to the target storage device, and at that stage would place a network load similar to what we would see from a source-based deduplication approach. To CommVault's advantage is their ability to manage data holistically, meaning that they can use incremental backups along with a synthetic full capability that is compatible with their deduplication. This reduces outbound data at the server/source level, and combined with data compression, this means that they will only require the transmission of new or modified blocks and files across the network, not the entire data set.
CommVault's software-based deduplication also can be configured to work with its replication capability, or it can leverage replication capabilities in the underlying storage systems, which can be any standard disk subsystem. With either approach, hardware replication or a software-based replication, only those changed blocks of data are sent across the WAN. While this provides a lot of choices for getting your backup data off-site it does not provide global dedupe capability across multiple sites, something that we have discussed in the past. So redundant data from across multiple sites will be stored, but redundant data sent from the same site.source, will be deduplicated across multiple back-up cycles.