Data Reduction for Primary Storage
Compression and de-duplication are about to enter the primary storage arena
May 7, 2008
In the past few years, data reduction technologies like compression and more recently data de-duplication have become quite popular, especially for use in backup and archiving. Can this trend continue into primary storage?
In backup, especially where there is a great deal of redundant data, there has been a mass adoption of data reduction technologies. In just a few short years, data de-duplication has gone from an obscure to a well known term in the data center. Its ability to eliminate redundant segments of data has provided great benefit to backup storage and some types of archive storage. In backup data, assuming a weekly full backup, a 20X storage efficiency quotient is not uncommon.
Primary storage is different
Unfortunately, moving de-duplication into primary storage isnt as simple as shifting its location. Following is an outline of the particular requirements of primary storage that need to be considered in planning de-duplication:
1. Primary storage is performance sensitive. Primary storage is active, and if the implementation of data de-duplication causes a performance impact on the production environment it will not be acceptable. Either the performance of the de-duplication technology must be so efficient and fast that it does not impact performance, or it has to be done out of band on files that are not immediately active.
The ideal is a near-production data set that is de-duplicated as a background process, removing the possibility of any performance impact. It would also make sense that this technology has the capability to de-duplicate and compress at different levels of efficiency --the greater the data reduction level, the greater the chance of impact on performance when the data is read back in. While it would be great to have an inline system that was fast enough to reduce the data set without impacting performance, the technology does not exist today.2. Primary storage is unique. The other challenge to reducing data on primary storage is owing to the fact that the data is generally unique. This is a very different situation compared to backup data. In a backup, especially when doing a full backup every day or week, there is a high level of data redundancy. While production data may have some commonality -- for example, “extra” copies of the same database -- for the most part, data is not nearly as redundant as backup data or even archive data.
As disk-based archiving and disk backups become more common, they are actually causing even less redundant data to be kept on primary storage. In the past there was value in keeping a couple of extra copies of a database or set of files on primary storage “just in case.” Now those copies can be very easily sent to disk archives or disk backup devices. (This is a good thing!)
Note: The current user expectation to see storage efficiencies of 20X or more should not even be considered on primary storage. A more realistic goal might be 3X to at most 5X.
3. Primary storage is compressed. In addition to being unique, much of primary storage data is already in some pre-compressed format. Files such as images, media files, and industry-specific data sets like SEG-Y are already pre-compressed. Even the data files from the latest version of popular office productivity applications are pre-compressed. These pre-compressed files often represent the largest data set in the enterprise and the one with the fastest data growth.
To deal with this uniqueness and the pre-compressed nature of production data, a successful primary data storage reducer will have to “dig a little deeper.” While inline data reduction has the clear advantage in the backup and archive categories, production storage is an area where out-of-band management of the process might be more valuable.Without the pressures to do data reduction so fast, time can be taken to examine a complex compound document and look for similarities within a file across the millions of files in the storage environment. This behind-the-scenes treatment of data also allows for time to be invested in understanding how specific formats -- .jpg, for example -- are stored; how that data becomes embedded into another document (for instance, a PowerPoint presentation); and how both the original data and its embedded occurrences might be best optimized for data reduction.
4. Primary storage is getting cheaper. The final challenge to data de-duplication on primary storage is the continual erosion of disk drive prices. The very condition that essentially killed HSM and later ILM may also be a detriment to the implementation of data reduction on primary storage. With 1 Tbyte SATA drives becoming available from the top-tier storage manufacturers, it may be deemed easier to simply buy larger capacity shelves of storage.
Getting value from reducing primary storage
First, to see value in attempting to de-duplicate primary storage, the data set being processed will have to be large, probably greater than 20 Tbytes, in order to see an appreciable return on investment. For example, reducing 50 Tbytes to 10 Tbytes is far more interesting than reducing 10 Tbytes to 2 Tbytes.
Second, other factors have to be considered other than just physical storage cost. By increasing storage efficiencies, power and space efficiencies will be reduced. In many data centers the single biggest challenge is finding space and power.Additionally, and especially with an out-of-band solution, if the data can be read optionally out of band, this could have a significantly positive impact on backup storage and network bandwidth utilization.
Transmitting and storing heavily compressed and optimized data should enable a measurable reduction in backup windows and backup storage. Data could still be sent to a disk-based inline data de-duplication technology that can eliminate redundant occurrences of the compressed data (multiple weekly full backups, as an example). Data that is compressed in this fashion becomes more portable, and it is more conducive to being sent across a WAN segment.
Theoretically, if you can make a 500-Gbyte external drive store 2 Tbytes worth of data, this is ideal for companies that need to send large projects from one facility to another. Finally, this data can be recovered very quickly for the same reasons; compressed data will consume less bandwidth going back across the network.
In addition, the technology to do this cannot be restricted by a single volume or limited to a single array controller. It will have to be leveraged across multiple array controllers from multiple manufacturers to increase the chances of redundant matches.
Implementation methods
Given all the considerations we’ve itemized, let’s look at a range of potential methods for implementing primary storage data reduction.Storage system-based implementations will be attempted by the primary storage providers and will typically be the result of a background crawl for data. In addition, products may also feature inline live data reduction. Some of these methods will be limited to comparing data at the volume level, or at most comparing volumes within a single array. This will limit the amount of data redundancy that can be found since there is a limited sample set to compare against. (When attempting data de-duplication, the broader the data set, the better the chances of higher storage efficiency.) The de-duplication announced earlier this year by NetApp is an early example of this method.
The inline compression method will be attempted by independent suppliers not tied to the big storage providers. The challenge with this approach will be one of focus. If a reduction product is involved in every disk I/O transaction and attempts to compress everything, latency may result. Also, if a system works inline, it may not be able to discern pre-compressed data and find a way to optimize it further. If a system ignores pre-compressed data, it may be ignoring the largest and fastest growing segments of data. Startup Storwize typifies this approach.
Mixed mode optimization will also be brought to market by third parties. With a mixed mode system, an out-of-band “system walk” will identify data suitable for reduction. Specific data types will be identified and the appropriate level of data reduction selected for each. This solution also has the ability to collect its data set across multiple volumes and storage systems, even if those storage systems are from different manufacturers. And a mixed mode system will be able to apply different levels of data reduction based on access patterns, increasing efficiency the older the file becomes.
A mixed mode system also will feature an in-band reader, so that when a file needs to be accessed, users won’t experience the delay in reading that would result if the file were taken from the reduced data set.
Mixed-mode architectures will also be able to move compressed data to an alternate volume on the same storage system or to a different system altogether. This capability allows the same tool used for optimal storage efficiency of near-production data to be leveraged as the primary mover of data for archiving to a less expensive tier. The result is a highly optimized data management strategy that not only moves the data to a less expensive disk tier, but also optimizes that tier for reduced data storage.If moving 50 Tbytes of data from Fibre Channel disk to 50 Tbytes of SATA disk is interesting, then moving that 50 Tbytes from FC drives to 10 Tbytes of SATA storage ought to be irresistible. Startup Ocarina offers an example of this method.
Conclusion
For data de-duplication to deliver a meaningful ROI, it needs to be deployed across multiple storage system platforms, handling specific nuances of the production data set without being intrusive to the environment. As the technology stands right now, the mixed mode optimization method comes closest to meeting these goals.
Have a comment on this story? Please click "Discuss" below. If you'd like to contact Byte and Switch's editors directly, send us a message.
NetApp Inc. (Nasdaq: NTAP)
Ocarina Networks
Storwize Inc.
Dallas Texas will be the host for Storage Switzerland’s May 22, 2008, conference on Data Center Virtualization. For more information click here0
You May Also Like