De-Duplication in Primary Storage

The next frontier for de-duplication may produce the biggest disagreement

George Crump

November 15, 2008

3 Min Read
Network Computing logo

10:45 AM -- Primary storage is the next frontier for de-duplication technologies, and it may be where we see the biggest disagreement in how to best optimize this storage. At a minimum, there will be multiple approaches deployed to fully address the problem of storage growth. Remember that much of the fantastic de-duplication rates we see in backup storage has to do with the reality that most users run a full backup job every weekend and there is comparatively little change in that data between those jobs. This is not the case in primary storage.

While there is some redundant data in primary storage, it is not to the degree that there is in backups. In addition, there are other technologies that might be a better use case than de-duplication. For example, writeable snapshots can be used to make copies of databases for development work instead of actually making a copy. While some storage systems have problems handling more than a few snapshots, the number that don't is decreasing every year.

There are also higher frequencies of modified data types that dont de-duplicate well. For example image files that have been modified -- a simple example is removing red-eye from a photo. When these image files are resaved, the original is often kept. While to the human eye the two images look similar, to the de-duplication system they look different. Companies like Ocarina Networks are beginning to offer systems that are data environment specific to handle de-duplication of this type of file. If there is a lot of this type of data in the organization, an environment-specific de-duplication tool could easily be cost justified against the reduced storage requirement.

There are cases where de-dupe on primary storage makes sense. The primary target, especially by NetApp Inc. (Nasdaq: NTAP), has been the VMware image, where there is plenty of redundant data. The other area is the user home directory. NetApp is a solution here as well. Riverbed Technology Inc. (Nasdaq: RVBD) has announced plans to extend WAN de-duplication by providing in-line de-duplication of primary storage. The initial offering will also focus on user home directories. Hifn Inc. (Nasdaq: HIFN) has de-dupe on a board that can be installed into a Linux server to provide in-line de-duplication of primary storage attached to that server.

All of this data is what can best be described as semi-active; data that is not being updated frequently but is not quite ready to be archived. Reduction of active primary storage, databases, and email stores remains elusive. In-line compression solutions like those from Storwize Inc. are a viable candidate. Testing has shown compression of active Oracle databases from NFS mounts suffer no performance impact while the footprint is reduced 60 by percent or more. Interestingly, compression does not impact the de-duplication process; the de-duplication solutions are still able to de-duplicate the data in its compressed form.A system that can compress active archive data, then de-duplicate it, and finally archive it may be the best way to reduce the size of primary storage. Use cases where the two techniques have been used in concert have shown a 95 percent reduction in capacity consumed.

— George Crump is founder of Storage Switzerland, which provides strategic consulting and analysis to storage users, suppliers, and integrators. Prior to Storage Switzerland, he was CTO at one of the nation's largest integrators.

About the Author(s)

SUBSCRIBE TO OUR NEWSLETTER
Stay informed! Sign up to get expert advice and insight delivered direct to your inbox
More Insights