Deduplication As An API

In my last entry I discussed the value of primary storage deduplication. In that entry I stated that for the benefits of the technology to be realized the storage vendors were going to have to get it implemented. This can be done via a third party appliance of course but many vendors are trying to figure out how to do this themselves. If they don't have the technology fully baked at this point the development cycle may be too long and the storage system supplier may be at risk of missing the boa

George Crump

June 11, 2010

3 Min Read
Network Computing logo

In my last entry I discussed the value of primary storage deduplication. In that entry I stated that for the benefits of the technology to be realized the storage vendors were going to have to get it implemented. This can be done via a third party appliance of course but many vendors are trying to figure out how to do this themselves. If they don't have the technology fully baked at this point the development cycle may be too long and the storage system supplier may be at risk of missing the boat. To fill that need we are seeing a small handful of vendors try to address this market with deduplication as an API.

Deduplication, even primary storage deduplication is not a brand new feature, several operating systems and NAS vendors have had the capability for a year or so but it is certainly newer than the backup use case. It's clear though that users are interested in the capability because of the value it can potentially bring to environments we discussed in our last entry. Deduplication as an API allows vendors to embed the technology into their existing storage source code. This not only gives the vendor a shortcut to offering what will become a must have capability but also, and maybe more importantly, more control over how that data is stored.

This control over and knowledge of the deduplication process could prove to be very valuable. Think of it the same way the Symantec's OST support changed the way backup applications interacted with disk storage devices. Once the backup application could have control over the device the process became much smoother. In the same way once the storage system has control over the deduplication process, better use of the technology may be able to occur. For example the storage system could process all data inline while there was no measurable impact on performance then shift to post process if storage I/O begins to be measurably impacted. In the same way they could possibly leverage the API to provide smarter, more efficient SAN replication than before. Not sending data that has already been sent from another site like some backup deduplication products do today.

The question for the suppliers of these API's is what is the impact to system performance and what is the complexity of the API? In other words can how long will it take to integrate the API set? The other issue is going to be the data modification impact. While an API makes it easy to turn something like deduplication on, will you be able to turn it off and what are the effects of doing so? That is going to be a critical issue.

I believe primary storage deduplication will be an expected feature on primary storage within the next one to two years as snapshots are today. If vendors can't get a primary deduplication product out within that time frame they need to be looking at an API type of solution ASAP. You don't want to be the only vendor bringing a knife to a gunfight.

About the Author(s)

SUBSCRIBE TO OUR NEWSLETTER
Stay informed! Sign up to get expert advice and insight delivered direct to your inbox
More Insights