As an industry, we have fallen into the trap of thinking of data deduplication as a single technology. When NetApp and EMC were in their bidding war for Datadomain, some analysts were wondering why EMC, which had deduplication technology in Avamar, would want Datadomain's technology. Now that data deduplication is gaining traction in the primary storage market, I thought I would point out that a deduplication system designed for primary data may not be as effective with back-up data.
All deduplication systems work by breaking down the files, or other objects like virtual tapes, into smaller blocks. They then identify those blocks that contain the same data, like the corporate logo on every PowerPoint slide, and use links in their internal file system so the single block of data they store can stand in for all the other copies of that data across the file system. Breaking the data down into blocks is easy, the hard part is figuring out what block alignment will result in the best data reduction. The simplest systems, like NetApp's or ZFS deduplication, simply break each file into fixed-size blocks. This works reasonably well for primary storage file systems that hold a large number of small files as each file starts on a block boundary. It works especially well for applications like VDI hosting where there are a lot of duplicate files.
Since the vast majority of today's backup applications create a small number of what are essentially tarballs or .ZIP files when they backup to disk, deduplicating backup targets have to work harder to determine where the block boundaries are. Content-aware systems like Sepaton's and Exagrid's reverse engineer the backup application's file and/or tape formats so they can identify each source-file in the stream and compare it to other copies of that file they've already stored. Other vendors have their own secret sauce, and while Datadomain's hash-based, variable block-size approach made sense when Hugo Patterson their CTO explained it to me last week, it's a bit too complicated to describe here.
Now imagine using a simple fixed-block deduping system with a backup stream. Your back-up app backs up the C: (system) and E: (data) drives of your server in a single back-up job to a single virtual tape file. The system logs are backed up early in the process, which causes all the data to be offset 513 bytes from where it was in yesterday's backup. While there may still be some duplicate blocks there won't be nearly as many as if the system could reset the alignment.
The moral of the story is all deduplication schemes are not alike. Use primary storage deduplication with the wrong backup app, and you may not see the 20:1 data reduction you're looking for. You'll see some data reduction, but we'll have to try them in the lab to see how much. Disclosure: I am currently working on projects for NetApp and EMC/Datadomain.Howard Marks is founder and chief scientist at Deepstorage LLC, a storage consultancy and independent test lab based in Santa Fe, N.M. and concentrating on storage and data center networking. In more than 25 years of consulting, Marks has designed and implemented storage ... View Full Bio