Disc Storage Choices for Archives

Earlier in our ongoing saga, I spelled out all but one of the attributes I'd like to see on the storage system that holds my archival data so later generations can read this blog and bask in my insight and brilliance. OK more like so LucasFilm can access all the outtakes from "Indiana Jones and the Geriatric Crusades" for the 30th anniversary limited edition, director's cut ultraviolet-Ray disk. Now let's look at how these features fit the architectures vendors have presented IT managers as solu

Howard Marks

June 4, 2009

3 Min Read
Network Computing logo

While a NetApp filer checks the integrity of each block withhashes as it's read, it doesn't do it in the background and can't call up acopy of a locally corrupted block from a remote copy when it finds a problem. Scalability and long termexpansion are also issues as adding drive trays and migrating data every 5-7years when your vendor will no longer support it aren't great solutions.

The other major player for disk based archiving is ContentAddressable Storage, which uses the hash of each stored object (file, emailmessage Etc.) as the primary identifier for that object rather than the file'slocation as NAS systems do. Contrary to popular belief, CAS systems don't usefull text indexes as their addressing scheme -- just object hashes. In fact, most CAS systems, including EMC'sCentera, Nexsan's Assurion and Caringo's CAStor, don't index theircontents.

They do inherently implement single instance storage (sincemultiple copies of the same file will generate the same hash) and typicallycheck file hashes for integrity in the background. Most CAS systems also have the ability tostore extended metadata beyond the name, owner and timestamps most file systemssupport. As a result, most have complexAPIs for file storage and retrieval requiring archiving software vendors towrite and test interfaces. SNIA has astandard XML API called XAM that should start appearing on CAS and other fixedcontent storage systems in the next year or so.

CAS vendors make a big deal of the extended metadata, andthey do have a point.  Functions likee-discovery, and data classification for ILM (Yes, it's a real concept just not aviable product) need more than just names and dates to make decisions. I'm just not convinced that a special filesystem with an API is needed to store it. Archiving software or content management systems can just as easily putthe metadata, and that all important full text index, in a database independentof the file system.

Some vendors have built NAS-like devices that use hashes toinsure data integrity and identify data as unique the way CAS systems do, whilenot using the hash as the primary address of the data object. Systems like DataDomain's appliances and NEC's Hydrastor are thought of as backup targets, buttheir feature sets match up to the archiving applications as well. Data Domain's boxes do data retention anddestruction just like any CAS. Permabit's Enterprise Archive similarly useshashes to help manage NAS data.

Most of these systems use the RAIN (redundant array ofindependent nodes) architecture where a cluster/grid of 1 and 2u servers withinternal storage hold and manage the data distributed across the array. Some use ingest/retrieval nodes that handlehashing and accepting data and storage nodes that just hold data. Others havemany peers that do both.
If fully implemented, the RAIN model can accommodate 100s ofnodes for huge scalability. It also allows new nodes, with faster processorsand bigger disks, to be added to an array -- the data on older slower and/or sicknodes is automatically relocated, and then the old nodes are removed all with a smallnumber of clicks or commands. However, most RAIN systems have a relatively highprocessor to spindle ratio, which adds to power consumption and may make largearchives that are rarely accessed expensive.

Well, I thought I was goingto get to passive storage this time, but I've babbled on long enough for onepost.  Next time: Don't be afraid ofoptical disk and tape for archiving.

Read more about:

2009

About the Author(s)

Howard Marks

Network Computing Blogger

Howard Marks</strong>&nbsp;is founder and chief scientist at Deepstorage LLC, a storage consultancy and independent test lab based in Santa Fe, N.M. and concentrating on storage and data center networking. In more than 25 years of consulting, Marks has designed and implemented storage systems, networks, management systems and Internet strategies at organizations including American Express, J.P. Morgan, Borden Foods, U.S. Tobacco, BBDO Worldwide, Foxwoods Resort Casino and the State University of New York at Purchase. The testing at DeepStorage Labs is informed by that real world experience.</p><p>He has been a frequent contributor to <em>Network Computing</em>&nbsp;and&nbsp;<em>InformationWeek</em>&nbsp;since 1999 and a speaker at industry conferences including Comnet, PC Expo, Interop and Microsoft's TechEd since 1990. He is the author of&nbsp;<em>Networking Windows</em>&nbsp;and co-author of&nbsp;<em>Windows NT Unleashed</em>&nbsp;(Sams).</p><p>He is co-host, with Ray Lucchesi of the monthly Greybeards on Storage podcast where the voices of experience discuss the latest issues in the storage world with industry leaders.&nbsp; You can find the podcast at: http://www.deepstorage.net/NEW/GBoS

SUBSCRIBE TO OUR NEWSLETTER
Stay informed! Sign up to get expert advice and insight delivered direct to your inbox
More Insights