Content-Addressable Storage

Content-addressable storage provides a solid foundation for data archiving, and major storage vendors are rolling out product offerings. We discuss the drivers for implementing CAS as well as what the

October 6, 2006

7 Min Read
Network Computing logo

 

 

There's been a lot of buzz in legal circles recently about United States v KPMG LLP. Short story, the feds accused the accounting firm of cooking up illegal tax shelters for rich clients. What caught our eye isn't the $456 million the firm will pay or even the $2.5 billion in evaded taxes. We noticed the case thus far has generated, in electronic or paper form, 5 million to 6 million pages of discoverable documents, of all shapes, sizes and types. That, my friends, is a prime example of why data-retention and digital-discovery requirements have lit a fire under the normally staid archival market.

Vendors are touting CAS (content-addressed storage), a once-boutique technology, as a way to make discovery requests more manageable. In a nutshell, a CAS system locates data by an array-assigned address, rather than by physical address or directory. Because the CAS device completely abstracts data from the hardware on which it resides, documents can be found based on content, rather than by storage location.

The earliest entry into this space, EMC's Centera, first released in 2002, is still the clear market leader in terms of CAS-capable units--mainly because EMC was first with a strong play. We use the term market loosely: IT's initial reaction to the Centera product--and CAS storage in general--was apathy with a side of confusion. Lacking the compliance impetus that drives today's business, enterprises weren't inclined to tweak their shiny new storage-area networks. EMC also bore some blame: Centera then required extensive software modifications to systems.Today, competitors big and small, including Caringo, Hewlett-Packard, Hitachi, IBM, Nexsan and Sun Microsystems, are bullish on CAS. We expect every major storage vendor to provide some iteration of CAS, albeit under the guise of a "complete archive-management system." Some have entries already, and we expect others to follow suit in the next 24 months.


By the Numbers
Click to enlarge in another window

How It Works

A CAS system comprises storage nodes, where data is physically kept, and access nodes, where metadata and information on the data's location on the storage nodes are kept. As new documents are passed to a CAS device, they are hashed, then stored based on that hash rather than any kind of directory table. Data is retrieved by requesting the resulting hash. CAS can cut down on duplication, and thus storage space requirements. Documents with even a small change will be saved separately from the original copy--the new version will hash differently--providing digital fingerprinting and versioned storage. Some vendors use this capability to keep only one copy of a given data set, removing the duplicates usually found on standard location-addressed storage.

As you may have deduced, because of the additional hash and metadata processing, CAS is best-suited for static documents. Thus the main use for CAS is data archiving. CAS' ability to track data, eliminate duplication and provide a foundation for archive management has never been more relevant: Companies are digitizing thousands of data types previously kept in analog format while also storing customer calls, security surveillance videos, invoices and more. CAS unarguably provides the rich metadata and data-change integrity features enterprises need to keep track of disparate data as it is marked it for retention until a certain date and migrated to other storage tiers.Another area that could benefit from CAS is the company e-mail store. Duplicate and litigation-sensitive data travels the e-mail system every moment of every day. Most e-mail archiving systems have the necessary hooks to work with offerings from major CAS vendors. On a larger scale, consider incorporating financial and company-created documents--or any data deemed vital to the business--from sources as varied as accounting programs to Word. Companies that face legal-discovery processes can benefit from the rich metadata tags CAS supplies.

Not So Fast

The story isn't all positive: Many CAS devices have significant shortcomings. Metadata standardization is nonexistent, for example. The SNIA (Storage Networking Industry Association) is creating a standard that will allow for the migration of XML-based metadata between different CAS systems, but those efforts are yet incomplete. Keep an eye on SNIA and ask your vendors about plans to implement eventual CAS standards.

In addition, some vendors, such as Hitachi with its Archivas-based Content Archive Platform and Caringo's forthcoming software, do not support the tracking and removal of duplicate data.

No single product available provides all the metadata, data manipulation and industry standards required for widespread use. However, development has been proceeding quickly, and we'll be watching upcoming versions of these products closely.Acronyms Across America

In one of the curious dichotomies that happens primarily in the technology sphere, the market for what CAS provides is getting red hot just as the term is losing its luster. Make no mistake, the technology behind CAS is relevant and being implemented as the foundation of archival systems by virtually every major storage vendor. But some are avoiding the term, choosing to focus on sophisticated archival-management systems that have CAS as their foundation. That's OK--storage and archiving are confusing enough without tossing in another term. We're less concerned with labels than we are with making sure you understand the ramifications of adding CAS technology to your archival strategy.

First, let's examine its main benefits: the ability to track changes to business data, which provides a verifiable method of ensuring that data hasn't been altered for legal-discovery purposes; the ability to use metadata to track disparate file types, which lets IT migrate data to appropriate storage media as needed and retrieve it efficiently; and the ability to remove duplicate data, which can save disk space.


Duplicate Data
Click to enlarge in another window

» Change tracking: By using change tracking, companies can show the evolution of a document. This is useful during legal discovery. Change tracking and content addresses are created from a hash routine. Because the paranoia police have declared hashing unreliable, nearly every CAS system allows for a new hashing algorithm to be applied if the one in use proves out of date.


Disk Technologies
Click to enlarge in another window

The hashing function is the primary bottleneck of a CAS system in terms of performance, but many vendors are dedicating hardware to hashing functions or conducting background hashing during non-peak usage.

» Metadata: When an enterprise has a rich metadata environment, the possibilities for search, categorization and mining of vital data extend as far as the eye can see. Location-addressable OSs don't store enough metadata to be useful in archiving. CAS solves serves as the foundation by which archiving operations can be performed.

CAS addresses other problems inherent in long-term archiving, for example, media rot. Media rot is not simply the degradation of physical storage media, it also defines the ephemeral nature of technology. Many mediums last much longer than the equipment used to read them, for example. CAS makes it easy to move data from one repository to another, be that disk, optical or tape, eliminating most media rot issues. For more on media rot, see Strategic Info Management: Long-Term Storage .

» De-duplication: Data de-duplication--when only one copy of a given file is kept on the storage system--is not yet universally available on CAS devices. That's unfortunate because the implications for the efficient use of storage and cost savings are clear. We recommend asking about de-duplication if you're considering a device that has CAS functionality.Who Needs CAS

Although the technology has been around for years, CAS offerings are relatively immature. Just as with storage virtualization, it's not the storage that gives the real business benefits, but the software that works on top of specialized CAS storage systems. For the foreseeable future, CAS adoption will likely be confined to large enterprises and specific vertical organizations, including government, health care, insurance, financial services, video/audio production and schools, simply because of the costs of implementation. Caringo is bucking that trend, hoping smaller companies will adopt its new CAStor software product, which runs on general servers and storage hardware. n

Steven J. Schuchart, Jr., a former NWC technology editor, is an analyst for competitive intelligence firm Current Analysis. Write to him at [email protected].

SUBSCRIBE TO OUR NEWSLETTER
Stay informed! Sign up to get expert advice and insight delivered direct to your inbox
More Insights