Now that data deduplication for primary storage is going mainstream, I'm starting to get questions from students at backup school about using storage systems designed to deduplicate files stored on them as backup targets. While the idea of using the same technology for your primary and backup data sounds attractive, some of the folks I've spoken to who have tried substituting a ZFS box for a DataDomain appliance have seen disappointing results.
For a block and hash deduplication scheme to identify, and therefore deduplicate, duplicate data, it has to break data up into blocks so the same data falls in the same block the same way. If data in one block is identical to data in another, but offset by just one byte, the two blocks will generate different hashes and the deduplication system will store them both. Some backup dedupe vendors, including Quantum and Data Domain, have sophisticated algorithms for figuring out where their variable size blocks should begin and end to maximize the probability of recognizing duplicate data. Quantum even has a patent on the technique that Data Domain gave them stock to license before EMC bought Data Domain.
Variable block deduplication requires a lot of compute power to figure out where the block boundaries should be. Since primary storage deduplication has to balance write latency against dedupe efficiency, and because it's usually implemented as an extension of an existing file system that wasn't designed to store on a block of 3,654 bytes and another of 1,892, primary storage deduplication systems usually use fixed block sizes.
That existing file system provides some assistance to the dedupe system by aligning the beginning of each file at the beginning of a block. This means primary storage dedupe will always identify duplicate files and work well with file systems that have a large number of small files. In addition, since many applications like databases read and write fixed size pages if the page size is a multiple of the underlying block size, duplicate data in a database or across databases will also be detected.
The problem primary storage dedupe systems have with backup data is that most conventional backup applications don't write the files they're backing up to the storage system intact, but instead create aggregate files that are the logical equivalent of .tar or .zip files. The media management functions of the backup apps then pretend that each aggregate file is a tape cartridge.