home
NEWS       BLOGS       FORUMS       NEWSLETTERS       RESEARCH       EVENTS       DIGITAL LIBRARY       CAREERS  
Network Computing Network Computing Powered by InformationWeek Business Technology Network

IMMERSE YOURSELF:

SOA

  |

Data Center

  |

802.11n

  |

Data Privacy

  |
APO  |

Virtualization

  |

NAC

  |

Security

  |

Network Mgmt

  |

Enterprise Apps

  |

Storage & Servers




The Data Shuffle

Need to organize your data? Here's a personal productivity tool for managing lists of information

By Dr. Rebecca Thomas

If there is one thing that the information age has created, it's gobs of data often in unwieldy chunks. The key to keeping organized is being able to extract the information you need in a format that you can use.

Tom Baker provides a Korn shell script implementation of a tool that helps manage line-oriented textual information. The user constructs a rule file that directs the script how to process the data files in their working directory.

Deal Me In

Dear Dr. Thomas:

My shuffle Korn shell program [Part A of Listing 1] is a rule-directed list processor designed to organize files containing lists. It is especially good for lists that undergo continual growth and revision, such as calendars, phone directories, event logs, and lists of things to do.

The name ``shuffle'' is based on a playing card metaphor. When cards are shuffled, they are swept together, mixed, and dealt back out into random hands. This script sweeps a set of list files together into one big file, then--under direction of regular expressions contained in rules defined by the user--deals the data back out into a new set of list files and, when directed, sorts them.

My lists contain one-line items of information: in principle, anything that can be expressed within a line or sortable sequence of lines [see Part B]. Some of the data are structured by an organizing principle, such as date, name, or priority.

These organizing principles are expressed in an editable set of rules [see Part C]. Minimally, each rule contains a search key, which is used with egrep to extract a line from a source file into a target file. Optionally, the rule also specifies a sort command for the target file.

When shuffle is run, it first concatenates all of the files specified as arguments into one big file, named by the Allfiles variable. After making a safety backup, it erases the originals, thus wiping the slate clean for their reconstitution. From this one aggregate file, shuffle extracts an entirely new set of lists.

Figure 1 shows a typical flow diagram. The first rule extracts every line (specified by the ``.'' pattern) from the source file, named by the Allfiles variable, into the target file--here called phone--effectively renaming combined.dat to phone, and then sorts it.

The second rule moves all lines that begin with a hyphen followed by a space character (``- '') from phone into 1993 and sorts it by year. The third rule moves all lines that start with the ``- 1994 '' pattern from 1993 into 1994. After all of the nine rules in our example have been applied, any lines that remain are left in the file named phone.

If you edit a data line to match a different rule, you mark that line for export to a different list. For example, I might expand the information from the line shown [in Part D] into the lines shown [in Part E]. Then when I run shuffle, the event lines will be moved into the 1994 log, Joachim Mann will go to the phone directory, and the article on SGML will end up in a list of things to do later.

When you edit the rules, older lists are merged or new ones created to meet new needs. For instance, the rule shown in Part F creates a separate list of things I need to follow up on, such as the two items from the Smith meeting.

Use of line-oriented data files means that I can use simple grep searching commands to locate items that meet certain criteria, for instance, ``show me everything I have on Smith'' (grep Smith) or ``what is my shoe size?'' (grep -i shoe) or ``when is the music library open?'' (grep musikbuecherei).

Furthermore, I often organize the elements of my data lines from general to specific, reading from left to right. This approach means that related items will be grouped together when sorted: lines referring to ``Clothes Shoes'' will remain near the ``Clothes Pants'' and ``Clothes Shirts'' in the residual phone file. This general-to-specific arrangement means that if a search doesn't tell me what I want to know because it was too specific (``Gap'' or ``Bean''), I can search for a more general category (``pants'').

I find that the rule file evolves as I edit the data. And because the rule file is just another list--albeit a special one--stored along with the files to which it refers, the set of lists is largely self-documenting.

Tom Baker / Bonn, Germany

Configuration Notes: The shuffle program was developed under the MKS Toolkit Korn shell running under DOS 3.3 and ported to Korn shell Version 11/16/88d running under System V Release 4.0.3. It has been tested under the environments mentioned in the ``acknowledgments'' paragraph near at the end of this column.

The configuration section (lines 12-23) was written to support both DOS-based MKS Toolkit Korn shell and Unix-based Korn shell versions as indicated by the comments. For instance, DOS-MKS Toolkit doesn't have the equivalent for the Unix ``bit-bucket'' file /dev/null, so a temporary file is used instead (line 21 instead of line 15).

Under MKS Toolkit, the rule file is named ``rules'' whereas Unix users can use ``.rules''. The latter usage lets one invoke the script using the asterisk wild card, as in shuffle *, without fear of shuffling the rule file. Also, MKS Toolkit does not have a command named nawk, but one can either copy awk.exe to nawk.exe or edit the script to invoke awk. By now, many implementations use awk as the name of the ``new'' awk program, instead of nawk, a name that was used when the new version was first introduced.

Usage Note: The shuffle program is designed to process data files under direction of a rule file all in the same directory. A backup subdirectory is created when shuffle runs.

Tester's Comments: It's a nice and useful script, but I was able to change it to handle multiline text to shuffle mail files or Usenet news articles. By employing the public-domain agrep program--which is record, not just line, oriented--and using ``^From '' as the field delimiter, I could extract data from our electronic mail support database. The same idea holds for our news archives, although I had to modify shuffle so it wouldn't combine all input into a single file, which could be many megabytes in size. Additionally, I would like to see shuffle allow read-only data files and allow sharing of files with my coworkers. The latter means I would need to remove the restriction to use files in a single directory. Also, there is no lock mechanism to prevent two instances of the program from running at the same time in the same directory.--Kees Hendrikse

The script runs unmodified under Unixware. It doesn't run on my BSD 386 system--which uses Bash instead of the Korn shell--unless you replace the print statements by equivalent echo statements. I also had to replace the nawk script by one written in Perl, which I obtained by translating the script using the a2p conversion utility provided by the Perl distribution. My guess would be that shuffle could be improved further by translating it completely to Perl.--Endre Bálint Nagy

For AIX 3.2 I had to rename awk to nawk, but both AIX and Ultrix 4.3 required that I not use the unsupported -M sort option in the ``rule'' file.--Steve Wright

This script worked fine under ISC 3.2.2, but had to be changed significantly to run with Coherent (version 4.2.05). [See Part G for the Coherent port of shuffle, which by the way, should also work with System V Release 2 and later Bourne shells and the old awk.]--Gábor Zahemszky

Wanted: Rewrite Shuffle in Perl

I'm looking for a Perl version of the shuffle program discussed here. We'll pay you US$100 for your trouble. You're welcome to enhance or improve, as long as you coordinate with me.

Acknowledgments

I wish to thank the following readers for their help with testing this month's contributions: Gábor Zahemszky, CoDe Ltd., Budapest, Hungary (ISC 3.2.2 Unix and Coherent 4.2); Kees Hendrikse, Echelon Consultancy, Enschede, The Netherlands (current SCO Unix and Xenix versions); Endre Bálint Nagy, Walton Networking Ltd., Budapest, Hungary (Unixware Application Server 1.0); and Steve Wright, Computer Science Dept., University of South Carolina, Columbia, S.C. (AIX 3.2).

Print This Page


e-mail Send as e-mail





Looking for a new job?

Function:

Keyword(s):

State:
SPONSOR
RECENT JOB POSTINGS
CAREER NEWS
The tumbling of IT jobs stopped in the second quarter, as the IT sector added about 44,000 jobs.

It's just a glimmer, but Oracle is starting to see a bit of light at the end of the recession tunnel.










2009 IT Salary Survey: Meager Raises, Solid Prospects
Though raises are notably smaller than a year ago, and job security’s shrinking, IT careers are looking safer than many others in this economic downturn. Get all the findings in InformationWeek's 2009 IT Salary Survey. Available FREE for a limited time.
 
ROLLING RIGHT ALONG
Follow key Network Computing Reviews from conception to completion. This Week: Holistic APM.



Network Computing Reports Emerging Enterprise Podcast Series: Secrets to Success








TechSearch


Microsite of the Week


Powerful Information at Your Fingertips



Techweb
Informationweek Business Technology Network
InformationweekInformationweek 500Informationweek 500 ConferenceInformationweek AnalyticsInformationweek Events
Informationweek MagazineGlobal CIOIWK Government ITbMightyByte and SwitchDark Reading
Digital LibraryIntelligent EnterpriseInternet EvolutionNetwork ComputingPlug Into The CloudDr. DobbsContentinople
space
TechWeb Events Network
InteropVoiceConWeb 2.0 ExpoWeb 2.0 SummitEnterprise 2.0Mobile Business ExpoNoJitter
Black HatGTECEnergy CampCloud ConnectGov 2.0 ExpoGov 2.0 Summit
space
Light Reading Communications Network
Light ReadingLight Reading AsiaUnstrungCable Digital NewsInternet EvolutionPyramid Research
Heavy ReadingLight Reading LiveLight Reading InsiderEthrnet ExpoTelco TVTower Technology Summit
space
Financial Technology Network
Advanced TradingBank Systems and TechnologyInsurance and TechnologyWall Street and TechnologyAccelerating WallstreetBST SummitBuyside Trading SummitIT Summit
space
Microsoft Technology Network
MSDNTechNetTotal IT ProTotal Dev ProNET Total Dev Pro CommunitySQL Total Dev Pro Community
space


App Infrastructure   |   Messaging & Collaboration   |   Network & Systems Mgmt   |   Network Infrastructure   |   Security  |   Storage & Servers   |   Wireless   |   Enterprise Apps
About Us  |  Contact Us  |  Site Map  |  Technology Marketing Solutions  |  Advertising Contacts  |   Briefing Centers
Copyright © 2009  United Business Media LLC  |  Privacy Statement  |  Terms of Service