Upcoming Events

Cloud Connect
Santa Clara
Feb 13-16, 2012

Cloud Connect brings together the entire cloud eco-system to better understand the transformation we're experiencing and promises to be the defining event of the cloud computing industry. Learn about the latest cloud technologies and platforms from thought leaders in Cloud Connect’s comprehensive conference.

Register Now!

More Events »

Subscribe to Newsletter

  • Keep up with all of the latest news and analysis on the fast-moving IT industry with Network Computing newsletters.
Sign Up


The Data Shuffle: Listings

Listing 1: The shuffle script processes line-oriented data, catenating it, then extracting selected lines into specified files with possible ordering.

A. Listing of the shuffle Korn shell script:

  1  #!/usr/bin/ksh
  2  # @(#) shuffle Version 5  A rule-based list processor
  3  # Author: Thomas Baker <tbaker@unix.amherst.edu>
  4  # Modified by: Becca Thomas, February 1994
  5  $DBG_SH                         # Dormant debugging directive
  6  
  7  trap 'rm -f $Tmpfile $Targetfilenames >|$Devnull 2>&1; \
  8      exit $Stat' 0
  9  trap 'print -u2 "$(basename $0): Interrupted!"; exit' 1 2 3 15
 10  
 11  # CONFIGURATION
 12  Allfiles=combined.dat               # File for all catenated input files
 13  Bkupdir=.backup                     # Unix input-files backup directory
 14  #Bkupdir=backup                     # MKS  input-files backup directory
 15  Devnull="/dev/null"                 # Unix bit-bucket file
 16  Rulefile=.rules                     # Unix rule file
 17  #Rulefile=rules                     # MKS  rule file
 18  Usage="Usage: $(basename $0) datafile [datafile ...]" # Correct usage
 19  # Temporary directory-dependent variables:
 20  Tmpdir=/tmp                         # MKS/Unix temporary directory
 21  #Devnull=$Tmpdir/null               # MKS  bit-bucket file
 22  Targetfilenames=$Tmpdir/sht$$.tmp   # MKS/Unix target-names file
 23  Tmpfile=$Tmpdir/shf$$.tmp           # MKS/Unix temporary work file
 24  
 25  # FUNCTION DEFINITIONS:
 26  function usage_exit {
 27      print -u2 "$Usage"; Stat=1 ; exit
 28  }
 29  function movelines { # Args: $Searchkey $Source $Target $Sortcmd
 30      print -n "Lines with [$1] moved from \""$2"\" to \""$3"\""
 31      egrep "$1" $2 >>$3; egrep -v "$1" $2 >|$Tmpfile; mv $Tmpfile $2
 32      [ "$4" ] && print ", ${4}." || print "." # Print sort command
 33      [ "$4" ] && { eval $4 -o $3 $3 ||
 34          { print "\aBad rule-file sort command: $4"; Stat=2; exit;};}
 35  }
 36  
 37  # PROCESS COMMAND-LINE ARGUMENTS:
 38  case $# in      # User must specify at least one file-name argument
 39      0)  usage_exit ;;
 40  esac
 41  
 42  # SANITY CHECK: Rule file:
 43  [ -r $Rulefile ] ||
 44      { print -u2 "\aCannot read \"$Rulefile\" file!"; Stat=4; exit;}
 45  sed 's/#.*$//' $Rulefile |          # Remove comments.
 46  egrep -v '^$' |                     # Remove blank lines.
 47  nawk -F\| '                         # Rules separated by vertical bar
 48  NR == 1 && ($1 != "." || $2 != "$Allfiles") {   # Check first rule
 49      print $0, ": rule 1 is illegal!" }
 50  NF != 3 && NF != 4 {                # All rules have 3 or 4 fields.
 51      print $0, ": must have 3 or 4 fields!" }
 52  $2 == $3 {                          # Source different from target.
 53      print $0, ": source cannot equal target!" }
 54  $4 != "" && $4 !~ /^sort/ {         # Field 4 is for sort commands.
 55      print $0, ": field 4 is only for sort!" }
 56  $1 == "" || $2 == "" || $3 == "" {  # First three fields are non-empty.
 57      print $0, ": 1 of first 3 fields is empty!" }
 58  { target[$3] = 1 }                  # Note names of target files
 59  NR > 1 {                            # For all lines after the first
 60      if ($2 in target)               # If source file is also a target
 61          next;                       # No problem, fetch next input line
 62      else print $0, ": ", $2, "has no precedent!"
 63  }' >| $Tmpfile                      # Save unique lines and display
 64  [ -s $Tmpfile ] &&
 65      { print -u2 "Bad rule format:\n$(cat $Tmpfile)"; Stat=5; exit;}
 66  
 67  # SANITY CHECKS: Current directory, combined data, backup directory:
 68  [ -w "." ] ||                       # Current (data) directory
 69      { print -u2 "\aCannot write to current directory!"; Stat=6; exit;}
 70  [ -f $Allfiles ] &&                 # Combined data file
 71      { print -u2 "\a\"$Allfiles\" should not yet exist!"; Stat=7; exit;}
 72  [ -d $Bkupdir ] || mkdir $Bkupdir 2>|$Devnull ||
 73      { print -u2 "\aCannot make directory \"$Bkupdir\"!"; Stat=8; exit;}
 74  [ "$(ls $Bkupdir)" ] && {           # if there are files in backup dir
 75  print -n "Okay to erase files in $Bkupdir (y*|Y*/n)? "; read ans
 76  case $ans in
 77      y*|Y*)  rm -f $Bkupdir/* >|$Devnull 2>&1 ;; # Remove old backups
 78      *)      print "Exiting, check $Bkupdir directory."; Stat=0; exit ;;
 79  esac;}
 80  
 81  # CHECK DATA FILES, BACK UP, THEN COMBINE INTO A COMMON FILE:
 82  for File in "$@"; do
 83      [ -d $File ] && continue                # Ignore directories.
 84      [ "$File" = "$Rulefile" ] && continue   # Ignore rules (just data).
 85      [ "$(dirname $File)" = "." ] || [ "$(dirname $File)" = "$PWD" ] ||
 86          { print -u2 "\aData files must be in current directory!"
 87          Stat=9; exit;}
 88      [ -r $File ] ||
 89          { print -u2 "\a\"$File\" file not readable."; Stat=10; exit;}
 90      { file $File | egrep 'text|empty' >|$Devnull 2>&1;} ||
 91          { print -u2 "\a\"$File\" not text nor empty."; Stat=11; exit;}
 92      egrep '^[   ]*$' $File >|$Devnull 2>&1 &&
 93          { print -u2 "\a\"$File\" has blank lines!"; Stat=12; exit;}
 94      cp $File $Bkupdir ||    # Copy to backup directory.
 95          { print -u2 "\aCannot back up $File!"; Stat=13; exit;}
 96      cat $File >> $Allfiles; rm $File   # Combine into common file.
 97  done
 98  
 99  # CHECK COMBINED DATA FILE:
100  [ -s $Allfiles ] || { print -u2 "\aNo data to process!"; Stat=14; exit;}
101  Beforesize=$(wc -c <$Allfiles | awk '{ print $1 }') # Data size before
102  print "Data backed up to \"$Bkupdir\", concatenated in \"$Allfiles\"."
103  
104  # PROCESS DATA FILES under direction of rule file:
105  OldIFS="$IFS"               # Save old internal field separator char(s)
106  IFS="|"                     # Rule-file field separator for "read"
107  sed 's/#.*$//' $Rulefile |          # Remove rule-file comments
108  egrep -v '^$' |                     # Remove blank lines
109  while read Searchkey From To Sortcmd ; do   # put fields into variables
110      eval Source=$From; eval Target=$To      # interpolate these var.
111      movelines $Searchkey $Source $Target $Sortcmd # Do the shuffle
112      print -u3 "$Target"             # Output goes to fd 3.
113  done 3>| $Targetfilenames           # Store fd3 output in a file.
114  IFS="$OldIFS"                       # Restore original IFS values.
115  Targetnames=$(sort -u $Targetfilenames) # Place unique list in variable.
116  
117  # CONCLUSION: Cleanup and exit message:
118  for File in $Targetnames $Allfiles; do
119      [ -s $File ] || rm $File        # Erase data files if empty
120  done
121  if [ $Beforesize -ne $(cat $Targetnames 2>|$Devnull | wc -c) ]; then
122      print -u2 "Warning: data may have been lost--use backup!\a\a\a"
123  else
124      print -u2 "Done: data shuffled and intact!"
125  fi

B. A sample data file:

- 1994 Feb 23 Smith 01 John Lunch at Panda East.
- 1994 Jan 23 Smith 02 Not coming to session, but writing paper.
- 1994 Feb 23 Smith 03 FOLLOWUP Read Sep 1993 SCILS article on SGML
Smith John 432 E43rd St, New York NY 01002 212-555-5555, fax 666-6666
Feb 10 BDAY Sarah (1956)
LATER Read SCILS article on SGML.
NOW Renew passport!
Beans stock and info 800-221-4221, customer service 800-341-4341
Clothes Shoes Timberland "Blucher" size W12
Convert US Ounces to Grams: 1 oz = 28.35 gm
Wallet [07 Sep 93] NY Drivers' # A01234 56789 123456 78, exp 7/96
- 1993 Dec 20 10am Called John Smith, set appt and faxed letter.
Wallet [07 Sep 93] Visa 1234-5678-1234-5678, lost: 1-800-423-3823
Fastback differential backup of C: c:/fastback/fb ')c)b)d)s))'
Clothes Shoes Adidas Marath.Train.II 1CA, size 12.5(D) 48(F) 13(USA)

C. A sample rule file:

# Rule file for "Shuffle: a rule-based list processor"
# 1. Rules contain: searchkey|source|target|optional_sort_command
# 2. First rule must have "." in first field, "$Allfiles" in second.
# 3. Common sort types:
#    sort                        Straight alphabetic.
#    sort +0M -1 +1n -2          Data format: Jun 25
#    sort +1n -2 +2M -3 +3n -4   Data format: - 1992 Jun 25
.|$Allfiles|phone|sort
^- |phone|1993|sort +1n -2 +2M -3 +3n -4
^- 1994 |1993|1994|sort +1n -2 +2M -3 +3n -4
^Jan |phone|calendar
^Feb |phone|calendar
^Dec |phone|calendar|sort +0M -1 +1n -2
BDAY|calendar|bday|sort +0M -1 +1n -2
^NOW |phone|now|sort
^LATER |phone|later|sort

D. Another example of a data-file line:

Jan 23 Smith John Lunch at Panda East.

E. Some transformations of the data-file line shown above in Part D:

- 1994 Jan 23 Smith 01 John Lunch at Panda East.
- 1994 Jan 23 Smith 02 Not coming to session, but writing paper.
- 1994 Jan 23 Smith 03 FOLLOWUP Sep 1993 SCILS article on SGML
- 1994 Jan 23 Smith 04 FOLLOWUP Call Joachim Mann 321-4567
Mann Joachim, tel 321-4567
LATER Read Sep 1993 SCILS article on SGML.

F. Another example of a rule-file line:

FOLLOWUP|1994|followup|sort +1n -2 +2M -3 +3n -4

G. A version of shuffle written for Coherent that runs under the Bourne shell with the ``old'' awk.

  1  #!/usr/bin/sh
  2  # @(#) shuffle Version 5  A rule-based list processor
  3  # Author: Thomas Baker <tbaker@unix.amherst.edu>
  4  # Modified by: Becca Thomas, February 1994
  5  # Modified by: Ga'bor Zahemszky, March 1994 to use sh and "old" awk
  6  $DBG_SH                             # Dormant debugging directive
  7  
  8  trap 'rm -f $Tmpfile $Targetfilenames >$Devnull 2>&1; exit $Stat' 0
  9  trap 'echo "`basename $0`: Interrupted!" >&2 ; exit' 1 2 3 15
 10  
 11  # CONFIGURATION
 12  Allfiles=combined.dat               # File for all catenated input files
 13  Bkupdir=.backup                     # Unix input-files backup directory
 14  #Bkupdir=backup                     # MKS  input-files backup directory
 15  Devnull="/dev/null"                 # Unix bit-bucket file
 16  Rulefile=.rules                     # Unix rule file
 17  #Rulefile=rules                     # MKS  rule file
 18  Usage="Usage: `basename $0` datafile [datafile ...]" # Correct usage
 19  # Temporary directory-dependent variables:
 20  Tmpdir=/tmp                         # MKS/Unix temporary directory
 21  #Devnull=$Tmpdir/null               # MKS  bit-bucket file
 22  Targetfilenames=$Tmpdir/sht$$.tmp   # MKS/Unix target-names file
 23  Tmpfile=$Tmpdir/shf$$.tmp           # MKS/Unix temporary work file
 24  
 25  # FUNCTION DEFINITIONS:
 26  usage_exit() {
 27      echo "$Usage" >&2 ; Stat=1 ; exit
 28  }
 29  movelines() { # Args: $Searchkey $Source $Target $Sortcmd
 30      echo "Lines with [$1] moved from \""$2"\" to \""$3"\""
 31      egrep "$1" $2 >>$3; egrep -v "$1" $2 >$Tmpfile; mv $Tmpfile $2
 32      [ "$4" ] && echo ", ${4}." || echo "." # Print sort command
 33      [ "$4" ] && { eval $4 -o $3 $3 ||
 34          { echo "\007Bad rule-file sort command: $4"; Stat=2; exit;};}
 35  }
 36  
 37  # PROCESS COMMAND-LINE ARGUMENTS:
 38  case $# in      # User must specify at least one file-name argument
 39      0)  usage_exit ;;
 40  esac
 41  
 42  # SANITY CHECK: Rule file:
 43  [ -r $Rulefile ] ||
 44      { echo "\007Cannot read \"$Rulefile\" file!" >&2 ; Stat=4; exit;}
 45  sed 's/#.*$//' $Rulefile |          # Remove comments.
 46  egrep -v '^$' |                     # Remove blank lines.
 47  oawk -F\| '                         # Rules separated by vertical bar
 48  NR == 1 && ($1 != "." || $2 != "$Allfiles") {   # Check first rule
 49      print $0, ": rule 1 is illegal!" }
 50  NF != 3 && NF != 4 {                # All rules have 3 or 4 fields.
 51      print $0, ": must have 3 or 4 fields!" }
 52  $2 == $3 {                          # Source different from target.
 53      print $0, ": source cannot equal target!" }
 54  $4 != "" && $4 !~ /^sort/ {         # Field 4 is for sort commands.
 55      print $0, ": field 4 is only for sort!" }
 56  $1 == "" || $2 == "" || $3 == "" {  # First three fields are non-empty.
 57      print $0, ": 1 of first 3 fields is empty!" }
 58  { target[$3] = 1 }                  # Note names of target files
 59  NR > 1 {                            # For all lines after the first
 60      ZGvar2 = 0
 61      for (ZGvar1 in target) {
 62          if (ZGvar1 == $2) {
 63              next
 64          } else {
 65              ZGvar2 = 1
 66          }
 67       }
 68      if (ZGvar2 == 1) {
 69          print $0, ": ", $2, "has no precedent!"
 70      }
 71  }' > $Tmpfile                       # Save unique lines and display
 72  [ -s $Tmpfile ] &&
 73      { echo "Bad rule format:\n`cat $Tmpfile`" >&2 ; Stat=5; exit;}
 74  
 75  # SANITY CHECKS: Current directory, combined data, backup directory:
 76  [ -w "." ] ||                       # Current (data) directory
 77      { echo "\007Cannot write to current directory!" >&2 ; Stat=6; exit;}
 78  [ -f $Allfiles ] &&                 # Combined data file
 79      { echo "\007\"$Allfiles\" shouldn't exist!" >&2 ; Stat=7; exit;}
 80  [ -d $Bkupdir ] || mkdir $Bkupdir 2>$Devnull ||
 81      { echo "\007Can't make directory \"$Bkupdir\"!" >&2 ; Stat=8; exit;}
 82  [ "`ls $Bkupdir`" ] && {            # if there are files in backup dir
 83  echo "Okay to erase files in $Bkupdir (y*|Y*/n)? \c"; read ans
 84  case $ans in
 85      y*|Y*)  rm -f $Bkupdir/* >$Devnull 2>&1 ;;  # Remove old backups
 86      *)      echo "Exiting, check $Bkupdir directory."; Stat=0; exit ;;
 87  esac;}
 88  
 89  # CHECK DATA FILES, BACK UP, THEN COMBINE INTO A COMMON FILE:
 90  for File in $*; do
 91      [ -d $File ] && continue                # Ignore directories.
 92      [ "$File" = "$Rulefile" ] && continue   # Ignore rules (just data).
 93      [ "`dirname $File`" = "." ] || [ "`dirname $File`" = "`pwd`" ] ||
 94          { echo "\007Data files must be in current directory!" >&2
 95          Stat=9; exit;}
 96      [ -r $File ] ||
 97          { echo "\007\"$File\" file not readable." >&2 ; Stat=10; exit;}
 98      { file $File | egrep 'text|empty' >$Devnull 2>&1;} ||
 99          { echo "\007\"$File\" not text nor empty." >&2 ; Stat=11; exit;}
100      egrep '^[   ]*$' $File >$Devnull 2>&1 &&
101          { echo "\007\"$File\" has blank lines!" >&2 ; Stat=12; exit;}
102      cp $File $Bkupdir ||    # Copy to backup directory.
103          { echo "\007Cannot back up $File!" >&2 ; Stat=13; exit;}
104      cat $File >> $Allfiles; rm $File   # Combine into common file.
105  done
106  
107  # CHECK COMBINED DATA FILE:
108  [ -s $Allfiles ] || { echo "\007No data to process!">&2; Stat=14; exit;}
109  Beforesize=`wc -c <$Allfiles | oawk '{ print $1 }'` # Data size before
110  echo "Data backed up to \"$Bkupdir\", concatenated in \"$Allfiles\"."
111  
112  # PROCESS DATA FILES under direction of rule file:
113  OldIFS="$IFS"               # Save old internal field separator char(s)
114  IFS="|"                     # Rule-file field separator for "read"
115  sed 's/#.*$//' $Rulefile |          # Remove rule-file comments
116  egrep -v '^$' |                     # Remove blank lines
117  while read Searchkey From To Sortcmd ; do   # put fields into variables
118      eval Source=$From; eval Target=$To      # interpolate these var.
119      movelines $Searchkey $Source $Target $Sortcmd # Do the shuffle
120      echo "$Target" >&3              # Output goes to fd 3.
121  done 3> $Targetfilenames            # Store fd3 output in a file.
122  IFS="$OldIFS"                       # Restore original IFS values.
123  Targetnames=`sort -u $Targetfilenames`  # Place unique list in variable.
124  
125  # CONCLUSION: Cleanup and exit message:
126  for File in $Targetnames $Allfiles; do
127      [ -s $File ] || rm $File        # Erase data files if empty
128  done
129  if [ $Beforesize -ne `cat $Targetnames 2>$Devnull | wc -c` ]; then
130      echo "Warning: data may have been lost--use backup!\007" >&2
131  else
132      echo "Done: data shuffled and intact!" >&2
133  fi

Figure 1: A data-flow diagram for the example discussed in Tom Baker's introductory letter.

$Allfiles
   |
   V                                      Sorted by year:
 phone ---> 1993 [^- ] -----\-----------> 1994 [^- 1994 ]
   |                         \----------> 1993 (everything else) 
   |
   V                                      Sorted by month:
 phone ---> calendar [^Jan,^Feb..] \----> bday [BDAY]
   |                                \---> calendar (everything else)
   |
   V        Sorted alphabetically:
 phone ---> now [^NOW ]
   |
 phone ---> later [^LATER ]
   |
   \------> phone (everything else)
Print This Page


e-mail Send as e-mail

Research and Reports

Hypervisor Derby
August 2011

Network Computing: August 2011

TechWeb Careers