De-Duping VM images

Unexpected rewards can come from combining data de-duplication, virtualization, and extended read caching using RAM and flash memory

April 6, 2009

3 Min Read
Network Computing logo

We've all suffered from time to time with the law of unintended consequences. Company buys you a shiny new smartphone and now you're expected to reply to emails in the middle of the first game at the new Yankee stadium. Upgrade to new faster tape drives and discover your Exchange backups take longer because the server can't keep up and the drive is shoe-shining.

Occasionally, however, we get blessed by the computing gods and rather than being stuck with unexpected consequences we get unexpected rewards. Such is the case when we combine three of the hottest technologies in the market today -- data de-duplication, virtualization, and extended read caching -- using RAM and flash memory.

Virtual machine images in most organizations, regardless of the hypervisor the organization chooses, contain lots of duplicate data. Fifty Windows VMs will each have 2 GB to 4 GB of common DLLs and other system files that any decent de-duplication scheme can reduce by at least 90 percent.

When I first started thinking about de-duping VM images, I thought condensing 50 or more -- especially in VDI like desktop VM environments -- virtual machine images would create an I/O hot spot as all those VMs try to access the same data. As I researched an article on primary storage data reduction for next week's issue of InformationWeek, I realized, with a little help from some vendors, two truths of frequently accessed, de-duped data like VM images.

The first is that de-duped data is almost by definition-static data only heavily accessed for read. As soon as someone changes one copy of de-duped data, that copy ceases to be de-duped. The second is since de-duped data increases how often a given data block is going to be accessed, it also increases the probability it will be cached. Combine de-dupe with a big read cache and you could really boost performance.Vendors have started coming up with fancy ways to extend their read cache too. NetApp's PAM uses standard DRAM, as opposed to the sexier but slower flash, to provide 16 GB of extended read cache on a PCI-E card. Lower end filers can take up to 2 PAM cards, but the top-of-the-line FAS6070 can handle up to 10 for 160 GB of additional cache.

Sun's Readzilla technology uses up to six 100-GB flash SSDs to build an extended read cache in their ZFS-based storage servers, including the X4540 that GreenBytes uses as the platform for their Cypress de-duping server. Six hundred gigabytes of read cache should be enough to cache the system drives for a substantial VM deployment.

Since both GreenByte's XFS+ and NetApp filers store iSCSI -- and, in the case of NetApp Fibre Channel, LUNs -- in their file systems, the benefits of cache and de-duping are available regardless of the storage protocol you choose for your VMs.

InformationWeek Analytics has published an independent analysis of the challenges around enterprise storage. Download the report here (registration required).

Howard Marks is chief scientist at Networks Are Our Lives Inc., a Hoboken, N.J.-based consultancy where he's been beating storage network systems into submission and writing about it in computer magazines since 1987. He currently writes for InformationWeek, which is published by the same company as Byte and Switch.6607

Stay informed! Sign up to get expert advice and insight delivered direct to your inbox

You May Also Like

More Insights