State of the Art: Data Naming
Employees create the files, but IT must design a workable naming policy to comply with regulations and avoid storage chaos.
September 16, 2005
The easiest approach, and the one many vendors prefer, is to add more junk drawers. Segregate stored files by user, department or some other granular approach into more discreetly named file folders or storage volumes. That way, we won't waste time searching for that PowerPoint file with the complex drawing we'd like to reuse in a new deck of slides we're developing.
Data-Naming Techniques |
Over time, however, file proliferation defeats even the best "segregate by storage device" methodology. File-folder contents proliferate, and seeking a specific file becomes futile. Backup processes strain to the breaking point as IT, not knowing which files are critical, are forced to back up everything. The result? Backups can't be accomplished within operational windows--and, worse, restores take forever.
Bottom line: Storage-centric approaches don't attack the core problem of data management, data naming.
What's in a Name? EverythingIf files were named consistently and intelligently, with a bit or code that let you sort them into coherent classes, much of the pain of file management would go away. Unfortunately, in a data democracy, users run the show when it comes to data naming. This authority seldom comes with a sense of responsibility. Users name files whatever they want, often in ways that provide no guidance to automated data-management processes such as hierarchical storage management, backup/restore and content tracking.
Classification SchemeClick to Enlarge |
Early efforts to get users to comply with a consistent data-naming scheme have taken the form of corporate policies combined with descriptions of file types. One large oil company, for example, issued users 2-inch binders containing mock-ups of typical documents--memos, letters, PowerPoint presentations--with instructions to compare files they've created to these templates. When the users found a close match, they were to include a 16- to 32-character alphanumeric string in the file name corresponding to the template. The idea was that any automated process could reference the string and cherry-pick files to be included in backups or moved to different storage repositories based on automated data-management schemes.
The users balked at the strategy's intrusiveness. Road warriors bristled at having to carry the binder in their cramped airplane carry-ons, while cubicle-bound employees simply ignored the binder because it added a time-consuming step to their daily practices. Lacking any draconian enforcement measures, such as firing uncooperative employees, the strategy went nowhere. Today, the oil company continues to search for a less disruptive method of imposing a data-naming scheme on end-user files.
File ClassificationFor many other companies, developing a naming policy is stymied at an even earlier stage: coming up with a coherent data-classification scheme. Files can't be treated as an anonymous data set; they vary in importance and utility, depending on the business context in which they're created and how frequently they're accessed. Creating a classification scheme for naming data starts with a painstaking analysis of the context of file creation to discern common classification attributes, such as file criticality, privacy requirements, access and update frequency, and retention needs.
Files inherit their classification attributes from the business processes that generate and use them. For example, file criticality--how important a file is in a disaster-recovery situation--is linked to the criticality of the business process itself. Files that support a critical business process in the wake of an unplanned interruption must be identified to ensure they're included in backups and assigned a higher priority for restoration.
Such a context analysis also can help identify files that must be afforded special privacy guarantees or retained for legal or regulatory reasons for a set time. Still other files must be pegged for deletion after a certain period to demonstrate an orderly program of "data decommissioning" in accordance with Securities and Exchange Commission rules and other legal mandates. Files containing or describing intellectual property or trade secrets also must be treated in a special manner.
Wouldn't you love to be able to segregate truly useful file classes from junk files and contraband data that too often consume a large percentage of expensive storage resources? Simplistic strategies for dealing with contraband data, such as performing undifferentiated sweeps to delete all music and video files, increasingly interfere with legitimate business applications that sometimes require the use of such data.
It's a huge challenge to identify a file's utility over time. In general, files are accessed far less often as they age, so a classification scheme must include frequency of access over time as a file-classification attribute to migrate stale data out of the storage repository.Access requirements are tough to assess with current tools. HSM (hierarchical storage management) software for the distributed environment, for example, migrates data based on file-creation date, date last modified or date last accessed. Distributed HSM lacks a critical function that exists in the mainframe world: the ability to count the number of times a file or data set has been accessed since it was last checked. Absent this information, it's difficult to distinguish rarely accessed files that can be safely migrated to archival storage from those that are referenced frequently but not modified--so-called "reference data," such as Web site content.
Juxtapose this challenge against another: Certain legal requirements, such as those embodied in HIPAA (Health Insurance Portability and Accountability Act) and various state laws, require patient files be retained for decades or, in some cases, as long as the patient is alive. Such data far outstrips the life expectancy of any contemporary storage device, and therefore must be named so it can be migrated from one platform to another in an orderly fashion over a span of many years. Without a data-naming scheme, the only way for companies to meet this requirement is to tag the data in question. Currently, the industry is providing "sticky technologies" for accomplishing this goal, requiring you to buy expensive proprietary CASS (content-addressable storage system) products like EMC's Centera or NexSAN Technologies' forthcoming solution. Such offerings create even more complexity in IT environments.
Defining a file-classification scheme requires groupthink. IT must work hand in hand with business-process owners, corporate legal departments, business-continuity planners and records administrators. Think of it as the ultimate crossover comic book: alien versus predator, but without the fun special effects.
Management buy-in is essential. In part, this is because IT typically lacks the authority to cross boundaries between lines of business to arrive at a comprehensive enterprisewide management solution. Additionally, a senior management mandate is often required to convince operations managers that the effort is more than academic. Some managers perceive a threat in the scrutiny of their business processes and workflows, as it might expose underlying operational inefficiencies. The data-classification effort, with its methodological similarities to business-process re-engineering efforts of the past, conjures bad memories. Therefore, we must convince participants that the analytical effort is intended not to evaluate efficiency, but only to identify data inputs and outputs.
To derive a data-classification scheme, organize business processes' component tasks and workflows into a coherent list. Once those elements are pinned down, their supporting applications and the data they produce can be carefully mapped.The result forms a grid. List data objects in rows. Then, after discussions with stakeholders, organize attributes into columns. Criticality, access frequency, retention requirements, deletion requirements, special handling requirements and other attributes should be defined and associated with each data object (see "Classification Scheme").
When you sort this list to find attributes held in common by different data objects, data classes will emerge. Deriving a data-classification scheme is really that simple--and that complex.
No fixed number of classes can be defined, and opinions vary widely about the efficacy and value of having greater or fewer classes. One thing seems certain: The effort to define data classes must be undertaken by enterprises individually. Despite initiatives from organizations such as the Storage Networking Industry Association to come up with a generic data-classification scheme, every enterprise has a unique set of business processes, so developing a one-size-fits-all approach is unrealistic.
Finally, users themselves--business managers and their staffs--must be involved in developing the file-naming scheme. Users know the business context of the files they produce. IT's role is to help implement data management based on that scheme.
Automating the ProcessThe troublesome aspect of data naming, as suggested above, is finding ways to apply the scheme to files at the time they're created by the user. Such strategies have a huge social-engineering component to them and don't guarantee that the proper class identifier will be applied consistently to all new data.
It can be helpful to arrange alternative solutions on a chart based on their disruptiveness--how much end-user involvement they require--and the degree to which their implementation depends on procedural, as opposed to technological, means (see "Data-Naming Techniques" ).
Manual naming schemes are the most procedural, disruptive techniques. Such strategies force users to adhere to a policy using a naming convention reference aid, such as the aforementioned binder. Similar strategies include compelling users--at the time the file is being saved--to complete the optional document-information screens found in many productivity applications.
These approaches have a poor track record. Where document-detail screens are turned on in corporate applications, for example, end users typically disable them. The only way to make such a system work is to allocate staff and time--two things most IT departments have in short supply--to review or spot-check adherence on an ongoing basis and resolve questionable categorizations. No formula or empirical data is available on the resource requirements for such labor-intensive monitoring.
Another approach is to capture user output into a set of directories appropriate to each file type. Global name-space system purveyors, such as NuView, have been pressing this alternative for some time. A global name-space is a persistent display of directory folders that doesn't change when the back-end physical infrastructure is modified. In the course of setting up such a name space, you can create a permanent set of file folders for containing discrete types of data. The problem is that this strategy tends to capture files into directories that don't really belong there, since they view all output from a given user or department as the same class of data. Still, you may be able to make this scheme work by combining it with intelligent text-searching capabilities--at least until something better comes along.System Solutions
File-system replacement is yet another potential technological solution. Over the past 10 years, IBM, Microsoft and Oracle have been talking about replacing conventional file systems with a database. This would make it possible to organize data better, search it faster and associate specific attributes with specific objects so they could be selected and managed in groups. If the replacement is transparent to users, it might let us organize files to the same degree we can now arrange structured data.
In the case of Microsoft's recent efforts to replace its file system with a SQL-based WinFS in its next-generation OS, pushback from pundits and consumers has made Redmond rethink its implementation timetable. If WinFS appears at all, it will not happen until at least 2008. Concerns include SQL Server's ability to scale with growing file repositories and the inherent problems of incorporating all file data (including files not produced by Microsoft applications) into the resulting repository.
Similarly, suppliers of document-management software, which is now recast as ECM (enterprise content management), are offering their preferred modality as the solution to file-management woes. Spokespersons from Documentum, FileNet and others suggest they could cross-reference files to an external content-management or file-metadata catalog, forestalling the need to come up with an approach to make files self-describing. With such a cross-referencing system in place, we could just replace those pesky office applications with predefined workflow screens and incorporate all those recalcitrant user files into a well-run document-management system that could offer many of the features found in mainframe HSM, including an access-frequency counter.
This outboard strategy is worth considering, but pay attention to setup, management and supporting infrastructure requirements. It's critical to ensure, for example, that the metadata reference catalog doesn't become a choke point in normal file access or create a single point of failure for the overall data-naming scheme.Still other technological solutions include modifying file systems to include additional attribute bits, such as the ones we use to make files "read-only," "archived" or "hidden," but referenced to data classes. Microsoft operating systems, for example, have recently added bit attributes for including certain files in fast indexing schemes, or to encrypt or compress files that have these bit attributes turned on. Extending the number and type of attribute bits to include data classes might have value, as long as users don't have to turn the bits on or off themselves.
Data management through bit attributes is one of many deep-blue-math solutions that pop up regularly in industry discussions. As of this writing, none have moved from the whiteboard into engineering specifications. For now, the best approach is to monitor the expanding range of options: Select an approach that gets the job done at least minimally, but plan to add more granular data definition and segregation capabilities as the technology matures.
In the final analysis, data management comes down to getting users to exercise the responsibility that comes with the authority to name their files anything they want. That's data democracy at its best.
Jon William Toigo is CEO of storage consultancy Toigo Partners International and founder and chairman of the Data Management Institute. Write to him at [email protected].
In a data democracy, where individuals create files and name them any way they see fit, organizing data stores and complying with regulations that necessitate perpetual storage are nearly impossible.Developing a consistent data-naming scheme is essential for organizing files logically, in a way that separates critical from nonessential, business from junk, publicly accessible from private.
Getting to this point requires management to support IT efforts and IT to work with users to develop appropriate naming criteria. Until technological solutions, such as global name-space systems, content-addressable storage and SQL-based file systems become commonplace, getting users to comply remains a major obstacle.
You May Also Like