Network Computing is part of the Informa Tech Division of Informa PLC

This site is operated by a business or businesses owned by Informa PLC and all copyright resides with them. Informa PLC's registered office is 5 Howick Place, London SW1P 1WG. Registered in England and Wales. Number 8860726.

FalconStor Heats up VTL Deduplication -- and the Data Deduplication War

In these difficult economic times, vendors have to do two things: 1) show prospective customers that spending scarce budget dollars on a vendor's products is economically justifiable and 2) show why a vendor's products should be preferred to those of its competitors.
Data Deduplication Attracts a Lot of Interest

One of the storage-related areas that have achieved increasing market success among enterprise customers over the last few years is disk-based backup, especially in conjunction with a virtual tape library (VTL). Lengthy downtime for 24x7 business-critical applications (and more and more apps seem to fall into that category) is unacceptable for both operational recovery and disaster recovery scenarios. VTLs offer significant performance and reliability advantages over tape (although tape is not going away) that have led to their increased adoption.

And what has proven to be an essential add-on for disk-based backups is data deduplication. Even though most recovery requests are for data created within the previous 24 hours, enterprises might like to keep backup data online for up to three weeks, a month or even longer. Some even choose to keep disk-based back-up available for as much as a year or more for compliance and other regulatory purposes. However, storing weekly full backups for any extended period of time is economically unsustainable -- particularly for companies counting their IT budget pennies.

Data deduplication offers organizations a way to backup their cake and eat it, too. For example, four weeks of full backups require approximately (with some caveats) 4X of data where X is the data for one weekly backup. However, assuming that the weekly rate of change is moderate (as it is for many organizations), most of the data backed up from week to week remains the same and so the amount of original data may be 1.1X or 1.2X of a full backup (where the average change is between 2% to 4% daily). Data deduplication is the process of squeezing the redundant data "water" out of backups for storage purposes; i.e. backing-up only new or non-redundant information. This ensures a much more effective use of disk-based resources and a more efficient use of storage budget dollars.

So data deduplication is a critical tool for disk-based backup and has also attracted attention to help with active production and active archiving purposes, as well. Vendors "get" the value of the technology, too. The dueling offers between NetApp (currently $1.9 B up from $1.5 B) and EMC ($1.8 B) for Data Domain have validated data deduplication and Data Domain's claim to fame as a provider of deduplicated disk-based backup storage for multi-vendor environments. Both EMC and NetApp have a generally solid track record with acquisitions coupling a strong vetting process in addition to a successful assimilation process. This brings to mind the old saying "Talk is cheap, but whiskey costs money," leading to the conclusion in that whoever purchases Data Domain should buy a lot of high-quality Data Domain "whiskey."

But NetApp is not the first player to have acquisitions in this space. EMC acquired Avamar in 2006 and has been actively expanding the use of that technology across its solution portfolio, such as its recently announced use with NetWorker. And just over a year ago, IBM acquired Diligent Technologies, which is also noted for its data deduplication capabilities. There are a number of other key players in the space including, CommVault, FalconStor, HP, Quantum, and Sepaton. The Data Protection Initiative of the Data Management Forum within the Storage Networking Industry Association (SNIA) recently published the Data Protection Initiative Members Buyer's Guide Fifth Edition Sprint 2009, which, among other things, talks about VTLs and data deduplication. (SNIA is also the source of a lot of other good information on data deduplication.)

FalconStor Throws Down the Performance Gauntlet

Today, let's focus on one of the leading players in the data deduplication space, FalconStor. With its notable OEM relationships (in addition to direct sales), FalconStor can rightfully claim the market share leadership space for VTLs, even though there are other strong players in the space. But the market is fiercely competitive and FalconStor is not resting upon its laurels.

In fact, the company made a recent announcement that's likely to roil the industry, claiming that its deduplication solutions deliver the fastest total time to disaster recovery (DR) in the Virtual Tape Library (VTL) environments. Total time to DR can be expressed as the sum of the time to back up a given amount of storage, the time to produce physical tape if required, deduplicate the data to a repository (as well as the time to replicate the deduplication repository at a remote site, which can be simultaneous with the deduplication process), and the time to restore from the deduplication repository.

FalconStor recognizes that while most benchmarks are performed in (and apply to) the hardware industry, the company felt that as a software company, it needed a reference point for establishing the capabilities of its solutions. Hence, creating a reference architecture for a customer became the impetus for FalconStor to build a reference environment in which tests could be conducted to measure and define the performance characteristics of its products in enterprise-class data center environments.

FalconStor used its latest version of its VTL with deduplication in conjunction with Symantec NetBackup using Symantec's OpenStorage (OST) API. Now, although FalconStor sometimes bundles software into an appliance for the convenience of its customers, it is primarily a software company. Consequently, for this test, FalconStor simply used open systems servers and commodity SATA storage. The test data set was 100 terabytes.

Measuring Performance

FalconStor uses a process called target deduplication, meaning that the backup is written to disk and that the dedupe process works on data that has been written to that disk. The ingest speed (i.e., how fast the backup data is written to disk) is the critical first step of the backup life cycle as, even today, meeting backup windows remains the number-one backup challenge in the enterprise (and is a fundamental reason for the growth in popularity of disk-based backup).

The disk on which the backup is written can be considered a "scratch pool" as it holds only the latest backup (but might also be used to restore data). The actual pool of data that holds the deduplicated data is a block level single instance repository (SIR) in FalconStor terminology. Not to be confused with single instance storage (SIS). SIS is a data reduction technique that keeps single instances at the file level; whereas, a SIR keeps a single instance at the block level in keeping with the fact that data deduplication is at the sub file level.

This is the price (up to 1X of extra storage) of performing target deduplication instead of source deduplication. Without inciting a religious war, this is a price that FalconStor customers and many other enterprises may very well be willing to pay instead of front-end server overhead.

FalconStor's reference environment used paired clustered VTL nodes. Each node handled its own workload, but were paired in an active-active configuration so if one node went down, the other node would pick up the workload (albeit at degraded performance). In total, four SIR nodes were used to deduplicate data from the scratch pool to the single global repository. Replication to a remote site (which can be simultaneous with the deduplication process) was at 4 GB/s, which is 500 MB/s. FalconStor felt that there was no bottleneck, as there was ample overhead to replicate data as it was created.

For the test, FalconStor states that the 100 TB backup was completed in less than 10 hours and that the 100 TB pool was deduped in 14 hours. Note that these times are parallel and not sequential. The dedupe process has to wait until the first virtual tape has been written to disk, but then it can start to operate. Note that a customer could also elect to write to physical tape before the dedupe process started. The policy engine that governs these processes also allows excluding certain backup jobs that are not suitable for deduplication (video streams for example.)

At the deduplication repository (either locally or remotely at a DR site), the data restore speed was 1.2 GB/s per node or 4.3 TB/hour per node. The time to restore the full 100 TB via two VTL nodes was 11.6 hours (and could have been performed faster if additional nodes were added to decrease restore time).

While there are no objective benchmarks (such as the SPEC benchmarks for measuring server performance) to help evaluate FalconStor's test, the company's performance numbers seem very impressive, particularly when one considers that 1 TB/hour backups were state of the art not that many years ago. So what does this all mean for the VTL market with data deduplication? First, we believe FalconStor deserves credit for being willing to step forward publicly and make performance claims (although at least one other vendor also makes performance claims related to its deduplication solutions).

This can be of particular value to businesses attempting to quantify the technical and economic value of deduplication. However, it may also signal the beginning of another "SPEC benchmark" war. Although those benchmarks eventually became a formal, well-recognized process, no recognized process exists for the effect of data deduplication on backup and restore so some vendors may consider FalconStor's test as an explicit challenge. How will they respond? They might: 1) claim that performance doesn't matter, 2) claim that the test was not comparable to what a real configuration would require, 3) ignore it entirely, or 4) claim that their performance is good enough, equal, or superior to FalconStor's.

How should existing or prospective deduplication customers evaluate those responses? To say that performance doesn't matter flies in the face of the evidence. To ignore it gives FalconStor a competitive advantage in seeking business. To fault the test procedures is a common tactic (with SPEC tests, special configurations that were unrealistic in the real world were sometime used). Nothing from a prima facie basis would suggest that FalconStor did anything out of the ordinary.

So what is really left for competitors is to create and perform similar analyses of their own products and present the data in ways that best represents their case. And that debate process can only be good for the industry and customers. Thanks FalconStor!