Storing Archival Data - Part Deux

Now that you've decided to build a real archive you need to figure out where, both physically and technically, you're going to keep it. Archives are data Roach Motels -- data goes in but doesn't check out for a long time. Which means it will outlast the 5-7 year useful life of most disk systems. Archive systems need to insure data integrity beyond vendor's end of life declarations.

Howard Marks

May 27, 2009

4 Min Read
Network Computing logo

Now that you've decided to build a real archive you need tofigure out where, both physically and technically, you're going to keepit. In the old days, archival storagemeant hard copy.  From the dawn of thecomputing age til at least the late eighties, storing digital dataelectronically was both too expensive and too risky. After all, it was way too easy to screw up a9-track tape as you threaded it onto a drive.
Now hard copy doesn't just mean green bar printouts. By the mid '70s, computer output to microform(COM) systems were in wide use holding archival copies of financial statementsand other important reports. As user-created data like word processingdocuments and emails had to be stored, tape and magneto-optical disks came tothe fore. Today most organizations havemoved to systems based on spinning magnetic disks -- but is that the rightchoice? To answer that question, let'sstart by looking at how archival data is different from active data and layingout the attributes that make a good storage system for archival data.
The primary difference between archival data and active datais that active data's contents are dynamic while an archive contains objectslike emails and documents that have fixed contents.  A user may edit and update a document forweeks, saving multiple interim revisions to a primary NAS and each version willreplace the previous version -- once the file is archived each new version has tobe saved, cataloged and indexed.
This nature of archival data means the storage device thatholds it should be designed to store multiple versions of files. While this function can be implemented in thearchiving software, having a storage system whose file system can store andtrack multiple versions expands the range of data movers you can choose from toinclude HSM or ILM tools not explicitly designed for archiving.
Retention enforcement on the other hand has to be a functionof the storage device. In fact, retentionenforcement was the raison d'etre for both the use of magneto-optical WORMdisks and the development of specialized storage systems for archivaldata. Users shouldn't be allowed todelete or modify files and emails in the archive to cover their tracks when theboss is looking to find out who caused us to lose the Johnson account.Some industries, most significantly SEC regulatedbroker-dealers and pharmaceutical companies (notice both traditionallyorganizations with money to spend), are required to keep their data in formsthat can't be deleted or modified by anyone -- even administrators beingthreatened by senior executives. Therefore, storage systems should have two levels of retentionenforcement. The first allows an administrator, hopefully after jumping throughmore hoops than just logging on as admin, clearing the RO flag and deleting thefile, to remove the backup of Julie's iPod the data mover shifted to thearchives. The second, for those highlyregulated industries and lawsuit magnets like tobacco companies, shouldn't allowdeletion or modification of any kind.
The flip side of data retention is data destruction. At theend of the specified retention period, a fixed content storage system shouldactually destroy the data by discarding encryption keys or overwriting datablocks, index entries and other metadata -- not just marking the object deletedand the space as re-usable. Of course,this should be an option as some organizations would rather have an un-deletefunction.
Data integrity assurance goes hand in hand with retentionenforcement. Retrieving a document from the archive to discover it's corruptedand the critical paragraph that would prove the company followed all the rulesand the CEO shouldn't be wearing an orange jump suit is now gibberish would bebad. Data objects should be hashed going into the archival store and thestorage system should check data against these hashes periodically and onretrieval. If the hashes don't match thesystem should retrieve another copy.
Which of course implies the system should store multipleindependent copies, preferably in multiple locations. This can be through data scatter and gathertechnology like Cleversafe's or simple replication between multiplesystems.  Policies should allow admins tospecify keep x copies in each of y locations.
Archives are data Roach Motels -- data goes in but it doesn'tcheck out for a long time.  While SarbOxand other general business regulations require 5 or so years of data retention,HIPPA and OSHA regulations require data be retained for 30 years or more undersome conditions. Since the volume ofdata in an archive 20 years from now isn't something you can predict, the systemhas to be extremely scalable.  Justsupporting 1,000 hard drives in many shelves on a small processor cluster likemost NASes isn't enough. Thisscaleabliliy can be provided with removable storage or a RAINarchitecture, where many processing and storage nodes can create a singlestorage cloud.Long data retention also means data will live longer thanthe useful life of the hardware it's initially stored on. The system should accommodate thisrequirement by supporting multiple generations of storage media on the samedevice (say CD, DVD and Blu-Ray for 20 years of optical storage) or multiplegenerations of storage nodes in a RAIN configuration so new nodes can be addedto the cluster and old ones cycled out when the vendor declares them end oflife.
Next time we'll look at architectures vendors are using forfixed content storage including CAS, RAIN, NAS and removable storage media.

Read more about:

2009

About the Author(s)

Howard Marks

Network Computing Blogger

Howard Marks</strong>&nbsp;is founder and chief scientist at Deepstorage LLC, a storage consultancy and independent test lab based in Santa Fe, N.M. and concentrating on storage and data center networking. In more than 25 years of consulting, Marks has designed and implemented storage systems, networks, management systems and Internet strategies at organizations including American Express, J.P. Morgan, Borden Foods, U.S. Tobacco, BBDO Worldwide, Foxwoods Resort Casino and the State University of New York at Purchase. The testing at DeepStorage Labs is informed by that real world experience.</p><p>He has been a frequent contributor to <em>Network Computing</em>&nbsp;and&nbsp;<em>InformationWeek</em>&nbsp;since 1999 and a speaker at industry conferences including Comnet, PC Expo, Interop and Microsoft's TechEd since 1990. He is the author of&nbsp;<em>Networking Windows</em>&nbsp;and co-author of&nbsp;<em>Windows NT Unleashed</em>&nbsp;(Sams).</p><p>He is co-host, with Ray Lucchesi of the monthly Greybeards on Storage podcast where the voices of experience discuss the latest issues in the storage world with industry leaders.&nbsp; You can find the podcast at: http://www.deepstorage.net/NEW/GBoS

SUBSCRIBE TO OUR NEWSLETTER
Stay informed! Sign up to get expert advice and insight delivered direct to your inbox
More Insights