Primary Storage Deduplication: NetApp

One of the first entrants into primary storage deduplication market was NetApp, with their Advanced Single Instance Storage (A-SIS, commonly known as NetApp deduplication). To my knowledge, NetApp was first to provide deduplication of active storage as opposed to data that had been previously stored. NetApp deduplication has certainly gained traction within the NetApp customer base, recently claiming that more than 87,000 deduped storage systems have been deployed with about 12,000 customers ben

George Crump

September 13, 2010

5 Min Read

NetApp deduplication is somewhat unique in that deduplication is really part of a vertically integrated stack of software based on their OS, Data ONTAP, and their file system Write Anywhere File Layout (WAFL). WAFL, like any other file system, uses a series of inodes and pointers commonly called extents to manage the information that the file system holds. Everything that is stored on a NetApp system is stored as a file whether it is actual file data or a blob that is presenting itself as an iSCSI or FC LUN. All these files are broken down into blocks or chunks of data, and in the WAFL file system all of the blocks are 4k in size.

As a result, each time a file is stored, its blocks are associated with a system of pointers. They leverage these 4k chunks to implement technology like snapshots and cloning. NetApp deduplication is enabled at the volume level. When a volume is enabled, the system begins an inline process of gathering fingerprints for each of these 4k chunks via a proprietary deduplication hashing algorithm. At intervals, either specified by the user or automatically triggered by data growth rates, a post-processing routine kicks in to determine any match in fingerprints, meaning that redundant data has been found.

After a byte-level validation check confirms identical data, the pointer to the redundant block is updated to point back to the original block, and the block that has been identified as redundant is released in the same way a block attached to an expired snapshot is released. The fingerprint itself leverages existing NetApp code "write block checksum," which WAFL has used since its inception. The bottom line is that NetApp should be commended for leveraging the capabilities of its existing operating system to deliver a modern capability.

There is a two-step process to adding deduplication, total time of which should, according to NetApp and in our personal experience, take about 10 minutes. The first step is to enable deduplication by installing the license. NetApp still does not charge for deduplication, so enabling the license is mostly a reporting function to let NetApp know who is using the feature. Once the license is enabled, there is no change in the behavior of the box, it just allows the system to execute the various deduplication commands.The second step is to run deduplication on a volume-by-volume basis. It is the user's choice. This can take a while, depending on the size of the volume and the number of blocks to be analyzed. It should not be huge time issue and the process can be scheduled. NetApp provides a best practices guide on what type of workloads you should run deduplication on. Not surprisingly, these workloads are situations where the chance for redundancy is fairly high.

At the top of the list is server virtualization images and user home directories. Because the NetApp system treats LUNs as files as well as regular files, the support of virtualized environments extends beyond just NFS-based VMware images. On the home directory, there is about 30 to 35 percent savings with their deduplication product.

Home directories generally see better results with a combined compression and deduplication, which NetApp does not currently offer. The third class is mid-tier applications that are business-critical but not mission-critical like Exchange and SharePoint. As with virtualized images, there is a high chance of redundant data in these environments.

The applications that they advise users to stay away from are those that are mission-critical and have a high performance storage I/O need. NetApp admits that there is some performance overhead with their deduplication, and that you need to be careful what kind of workloads you have it handle. Most of the performance impact is the result of walking the file system and validating the duplicate data. Reading from the deduplicated pools is an extent management task for the operating system, very similar to reading from a snapshot, and imposes no significant overhead.

The two potential limitations that are up for debate is NetApp's use of deduplication as a post-process and the current lack of inclusion of compression. In both cases, NetApp will claim concerns about performance overhead vs. any potential added value to inline deduplication or adding compression.It is up to the user to balance if post-process deduplication is a limitation or not and how much value compression might bring to them vs. the potential for performance loss, which is critical in primary storage. In addition, older versions of Data ONTAP places restrictions on the amount of deduplicated data that any singular volume, can contain, up to 16TBs.

In the future, NetApp will be raising this bar to about 50TBs per volume. Potentially, most of their current performance limitations will be less of an issue as they ride the same faster processor wave that any Intel based storage system will benefit from. The deduplication is only on a per volume level not across volumes. As a result, data on one volume that is the same as data on another volume will not be identified as redundant.

For many environments, these are minor limitations and despite them, NetApp clearly has the market lead in primary storage deduplication and the most user case studies to be referenced. As stated earlier, they deserve credit for leveraging the underpinnings of their existing operating system instead of re-inventing the whole process. I believe this approach provides a customer with greater comfort as they use the deduplication feature.