Primary Storage Dedupe For Backup?

Now that data deduplication for primary storage is going mainstream, I'm starting to get questions from students at backup school about using storage systems designed to deduplicate files stored on them as backup targets. While the idea of using the same technology for your primary and backup data sounds attractive, some of the folks I've spoken to who have tried substituting a ZFS box for a DataDomain appliance have seen disappointing results.

Howard Marks

February 22, 2011

For a block and hash deduplication scheme to identify, and therefore deduplicate, duplicate data, it has to break data up into blocks so the same data falls in the same block the same way. If data in one block is identical to data in another, but offset by just one byte, the two blocks will generate different hashes and the deduplication system will store them both. Some backup dedupe vendors, including Quantum and Data Domain, have sophisticated algorithms for figuring out where their variable size blocks should begin and end to maximize the probability of recognizing duplicate data. Quantum even has a patent on the technique that Data Domain gave them stock to license before EMC bought Data Domain.

Variable block deduplication requires a lot of compute power to figure out where the block boundaries should be. Since primary storage deduplication has to balance write latency against dedupe efficiency, and because it's usually implemented as an extension of an existing file system that wasn't designed to store on a block of 3,654 bytes and another of 1,892, primary storage deduplication systems usually use fixed block sizes.

That existing file system provides some assistance to the dedupe system by aligning the beginning of each file at the beginning of a block. This means primary storage dedupe will always identify duplicate files and work well with file systems that have a large number of small files. In addition, since many applications like databases read and write fixed size pages if the page size is a multiple of the underlying block size, duplicate data in a database or across databases will also be detected.

The problem primary storage dedupe systems have with backup data is that most conventional backup applications don't write the files they're backing up to the storage system intact, but instead create aggregate files that are the logical equivalent of .tar or .zip files. The media management functions of the backup apps then pretend that each aggregate file is a tape cartridge.

Juniper Networks Announces AI-Native Networking Platform

Zeus Kerravala, Founder and Principal Analyst with ZK Research

January 31, 2024

Bob Friday, Chief AI Officer for Juniper Networks, explains how the advanced technology is transforming operations.

Understanding Why Contact Center Agent Empowerment is Critical to a Great Customer Experience

Zeus Kerravala, Founder and Principal Analyst with ZK Research

January 29, 2024

Contact center leaders from 8x8, Awaken Intelligence, and 360insight discuss the importance of agent experience.

AI Drives the Ethernet and InfiniBand Switch Market

David Curry, Technology Writer

January 27, 2024

AI may force enterprises to rewire parts of their data centers so they are fully optimized to run such workloads. The question is do you use Ethernet or InfiniBand?

Primary Storage Dedupe For Backup?

Tags:

Recommended For You

Juniper Networks Announces AI-Native Networking Platform

Understanding Why Contact Center Agent Empowerment is Critical to a Great Customer Experience

AI Drives the Ethernet and InfiniBand Switch Market

Search form

Primary Storage Dedupe For Backup?

Tags:

Recommended For You

Juniper Networks Announces AI-Native Networking Platform

Understanding Why Contact Center Agent Empowerment is Critical to a Great Customer Experience

AI Drives the Ethernet and InfiniBand Switch Market