Rapid Backup And Retrieval With Riverbed's Whitewater

Cloud storage brings cost effective offsite backup and retrieval capabilities to small and midsize businesses, but using cloud storage as an external disk is not as effective as it seems, particularly as the data set grows. It takes time to send and retrieve files to a cloud storage provider and often means using another set of tools to do so. Riverbed's Whitewater appliance makes the cloud storage appear as a backup target and balances the competing needs of speedy read/writes and long-term bul

Howard Marks

November 28, 2011

8 Min Read
Network Computing logo

Cloud storage brings cost-effective offsite backup and retrieval capabilities to small and midsize businesses, but using cloud storage as an external disk is not as effective as it seems, particularly as the data set grows. It takes time to send and retrieve files to a cloud storage provider and often means using another set of tools to do so. Riverbed's Whitewater appliance makes the cloud storage appear as a backup target and balances the competing needs of speedy read/writes long-term bulk storage. Using a combination of local file caching and deduplication with cloud storage replication, IT gets the best of both worlds.

Small and midsize organizations frequently face a challenge getting their backup data offsite on a regular basis. The traditional solution has been to periodically send backup tapes to a records warehouse run by Iron Mountain, Recall or the like. That creates a lot of work for backup administrators, as they have to create and manage additional jobs to duplicate backup data to tapes for offsite storage, then box up the tapes and deal with the courier. The result is that most midsize companies send tapes offsite only once a week, leaving their data vulnerable.

Organizations with more than one data center can install deduplicating backup appliances and replicate their backup data from one data center to another while deduplicating over the WAN, assuming, of course, that they're using the same brand appliance in each data center. Organizations with just one data center have to either stick to tape or use an Internet-based backup service.

Online backup services like LiveVault and eVault have been available for 15 years or more, but many organizations have been uncomfortable converting from the backup software they've spent months getting to work properly and handing their backup process to a third party lock, stock and barrel. As more general-purpose cloud storage providers like Amazon S3 and Nirvanix have developed, IT professionals have looked for the option to use their existing backup software to send their backup data offsite to a cloud provider. Add to all this the fact that most cloud storage providers charge significantly less than their online backup brethren to store each gigabyte for a month, and cloud backup starts looking very attractive.

Backup applications have started adding the ability to duplicate backup data to cloud providers, but most don't deduplicate your data before sending it to the cloud. Since you're going to be paying the cloud provider of your choice for each and every gigabyte of data they store for you each month, deduplication would definitely be a good idea.

Riverbed's Whitewater appliances act as gateways between your backup applications and the cloud provider of your choice. They look to your backup application like network-attached storage, accepting data via SMB and/or NFS. Once you send your data to the Whitewater appliance, it compresses, deduplicates and encrypts your data using AES and sends it off to your cloud provider. Unlike Riverbed's network deduplication products, Whitewater does not re-hydrate your data at the cloud storage provider.

The Whitewater product line ranges from a virtual appliance starting at a list price of $7,995 to three physical appliances with 2, 4 and 8 Tbytes of usable disk space, respectively, starting at list price of $23,995. Since the disk in the appliance is just a cache, there's no technical limit on how much data you can store on a Whitewater appliance. Whitewater appliances can store their data on most public cloud providers, including AT&T Synaptic Storage, Amazon S3, Microsoft Windows Azure, Rackspace Cloud files and Nirvanix. You can also use them to front-end a private object storage infrastructure build on EMC's Atmos or Openstack Swift.Whitewater deduplicates backup data inline, storing it to the local cache and sending the deduplicated 4-Kbyte data blocks to the cloud storage provider as quickly as it can over your Internet connection. When the local Whitewater cache fills, it overwrites the least recently accessed data on the local cache first, assuming the oldest data is least likely to be recalled.

The Whitewater appliance compares only incoming data to the local cache to see if the data block is a duplicate. It does not compare incoming data to data blocks stored in the cloud service. If you perform a backup that contains data blocks that had been deduplicated once, but had aged out of the local cache, it will be stored to the cache and the cloud provider again, since the Whitewater appliance sees it as new data. Ultimately, you can end up with duplicate data blocks in the cloud. The longer you retain backups, the more new and duplicate data blocks will be stored. Naturally, if the backup file is deleted locally, it will be deleted in the cache and the cloud, as well.

At restore time, the Whitewater appliance will reassemble your data using data chunks from its cache, which will probably include all the blocks from last night's backup, and the performance is what you would expect for a deduplicating backup appliance. When you restore older backup data, some blocks may have aged out of the cache, so the Whitewater will retrieve those chunks from your cloud storage provider. If more than 80% of the data you're restoring is in cache, you should get good restore performance. As the amount of data that needs to be retrieved from the cloud increases, performance will be limited by your Internet connection.

In the event of a Whitewater appliance failure or bigger disaster, users can just set up a new Whitewater and start restoring their data. Since there's a virtual appliance version, you can start restoring data without waiting for Riverbed to overnight a replacement appliance.

Riverbed sent us the top-of-the-line Whitewater 2010 to test and we ran it through its paces at the Lyndhurst N.J.-based DeepStorage.net. We backed up our production file server, which holds about 720 Gbytes of assorted Office documents, software install points and our collection of World War II training films. We then used the Whitewater GUI and an SNMP monitor on our Internet gateway to see how the Whitewater reduced the data and sent it to the Amazon S3 instance Riverbed set up for our testing.

On the initial backup, the Whitewater's GUI indicated that it reduced the data about 2.1-to-1. Given that we had sent it a collection of compressed files and media, along with the Office files, in our initial backup, we got the level of deduplication we expected. We ran some scripts to introduce about 2% of new and changed files and backed the data up again, repeating that process for a total of two full backups. Each time the Whitewater reported our deduplication ratio climbing as it should have until we ended up at about 4-to-1.

Checking the S3 site, we saw Amazon also indicated that there was roughly 350 Gbytes of data stored for our 1.4 Tbytes of backups, which told us the Whitewater UI wasn't lying to us. Of course, sending that initial 350 Gbytes of data took a while over the lab's 50-Mbps down/10-Mbps up cable modem connection, but the Whitewater kept the link at 90%-plus saturation over the time it took to upload. We would advise Whitewater users to set priorities in their routers to allow the Whitewater to send data without bogging down other applications, though that wasn't a problem on our asymmetrical link.

When we switched to performance testing, we had difficulty getting our test system to ingest data at its full rated speed of 1 Tbyte per hour, though we did manage to back up data at more than 720 Gbytes per hour. We had this problem at least in part because we had to manually allocate backup streams across the Whitewater's four Ethernet ports. Since we completed our testing, Riverbed has updated the Whitewater software to support NIC teaming, which should simplify the process and make it easier to cram data into the higher-end Whitewaters quickly.

All in all, we're quite pleased with the Whitewater but would like to see Riverbed enhance the reporting functions. You can see how much data is flowing into and out of the appliance, but not each Ethernet port. You can see how much data is waiting to be replicated, but not how long that replication is going to take. Finally, we'd like to see some reporting on how data is distributed between the cache and the cloud storage back end. Knowing that restores from any backup made in the last 10 days will come completely from cache would be reassuring.How we testedWe connected all four of the 1-Gbps Ethernet ports on the Whitewater 2010 Riverbed provided for testing to an Extreme Networks X480 switch. We then configured four CIFS shares on the Whitewater for use as disk targets from our media servers.

We used two servers, one Dell Poweredge 2950 and one Dell R710, as media servers running Backup Exec 2011 R3 to send data to the Whitewater. Each server had four Gigabit Ethernet ports in two NIC teams. One NIC team was used to collect data from the source servers while the other was used to send data to the Whitewater.

In addition to backing up data from the media servers' local disks, we backed up test data from an additional eight servers running Windows Server 2008 R2, including three Supermicro Xeon E3 servers equipped with Micron SSDs to maximize backup throughput.

About the Author(s)

Howard Marks

Network Computing Blogger

Howard Marks</strong>&nbsp;is founder and chief scientist at Deepstorage LLC, a storage consultancy and independent test lab based in Santa Fe, N.M. and concentrating on storage and data center networking. In more than 25 years of consulting, Marks has designed and implemented storage systems, networks, management systems and Internet strategies at organizations including American Express, J.P. Morgan, Borden Foods, U.S. Tobacco, BBDO Worldwide, Foxwoods Resort Casino and the State University of New York at Purchase. The testing at DeepStorage Labs is informed by that real world experience.</p><p>He has been a frequent contributor to <em>Network Computing</em>&nbsp;and&nbsp;<em>InformationWeek</em>&nbsp;since 1999 and a speaker at industry conferences including Comnet, PC Expo, Interop and Microsoft's TechEd since 1990. He is the author of&nbsp;<em>Networking Windows</em>&nbsp;and co-author of&nbsp;<em>Windows NT Unleashed</em>&nbsp;(Sams).</p><p>He is co-host, with Ray Lucchesi of the monthly Greybeards on Storage podcast where the voices of experience discuss the latest issues in the storage world with industry leaders.&nbsp; You can find the podcast at: http://www.deepstorage.net/NEW/GBoS

SUBSCRIBE TO OUR NEWSLETTER
Stay informed! Sign up to get expert advice and insight delivered direct to your inbox
More Insights