Primary Storage Dedupe For Backup?
Now that data deduplication for primary storage is going mainstream, I'm starting to get questions from students at backup school about using storage systems designed to deduplicate files stored on them as backup targets. While the idea of using the same technology for your primary and backup data sounds attractive, some of the folks I've spoken to who have tried substituting a ZFS box for a DataDomain appliance have seen disappointing results.
February 22, 2011
Now that data deduplication for primary storage is going mainstream, I'm starting to get questions from students at backup school about using storage systems designed to deduplicate files stored on them as backup targets. While the idea of using the same technology for your primary and backup data sounds attractive, some of the folks I've spoken to who have tried substituting a ZFS box for a DataDomain appliance have seen disappointing results.
For a block and hash deduplication scheme to identify, and therefore deduplicate, duplicate data, it has to break data up into blocks so the same data falls in the same block the same way. If data in one block is identical to data in another, but offset by just one byte, the two blocks will generate different hashes and the deduplication system will store them both. Some backup dedupe vendors, including Quantum and Data Domain, have sophisticated algorithms for figuring out where their variable size blocks should begin and end to maximize the probability of recognizing duplicate data. Quantum even has a patent on the technique that Data Domain gave them stock to license before EMC bought Data Domain.
Variable block deduplication requires a lot of compute power to figure out where the block boundaries should be. Since primary storage deduplication has to balance write latency against dedupe efficiency, and because it's usually implemented as an extension of an existing file system that wasn't designed to store on a block of 3,654 bytes and another of 1,892, primary storage deduplication systems usually use fixed block sizes.
That existing file system provides some assistance to the dedupe system by aligning the beginning of each file at the beginning of a block. This means primary storage dedupe will always identify duplicate files and work well with file systems that have a large number of small files. In addition, since many applications like databases read and write fixed size pages if the page size is a multiple of the underlying block size, duplicate data in a database or across databases will also be detected.
The problem primary storage dedupe systems have with backup data is that most conventional backup applications don't write the files they're backing up to the storage system intact, but instead create aggregate files that are the logical equivalent of .tar or .zip files. The media management functions of the backup apps then pretend that each aggregate file is a tape cartridge.Data within one of these aggregate files, even when it's the same data that we backed up last week, isn't necessarily in the same place it was the last time. If a new 27-byte file--say, c:aaaa.txt--is backed up at the beginning of the job, all the rest of the data will be offset 27 bytes, confusing the fixed block dedupe system. To add insult to injury, the aggregate file formats include metadata about both the data being backed up and the progress of the backup job. This metadata, which is interspersed with the real data, can also confuse a simple dedupe process.
The vendors that make deduping appliances for backup data spend a lot of time reverse engineering the common backup app aggregate formats to ensure that their systems can get the maximum deduplication ratios for the data they're storing. Vendors like Sepaton and Exagrid that do delta differential, as opposed to block and hash deduplication, get much, if not most, of their deduplication by examining the backup stream, seeing that WINSOCK.DLL is being backed up yet again and saving a pointer to the last copy.
So while we can dedupe primary storage and backup data systems, tools specific for the kind of data you're trying to store will probably give you a better combination of data reduction and performance than one designed for another task or a Swiss Army knife.
About the Author
You May Also Like