July 26, 1999
The next step is low-risk testing. This second deployment should be on production servers that you could relinquish for a short time. These servers often have configurations similar to your big production servers, so if you don't catch a problem with the fix in the labs, you should catch it here. Our site uses an IT-department-specific server for our low-risk deployment. If something goes awry, only the IT department is affected.
Naturally, high-risk servers are the last place on which to deploy your fixes. If you've succeeded with the no-risk and low-risk fixes, you'll most likely have luck with the high-risk fix. In particular, load-dependent problems can only be caught on your low-risk servers if you adequately stress them during testing.
Don't forget to deploy the pack to your low-risk and high-risk servers in exactly the same way. For example, when we tried to implement SP4 via a CD-ROM server share, we discovered that our older CD-ROM server's inability to support long file names was a problem. Although LFN (long file names) support didn't matter at all in SP3, SP4 generated a "file not found" error on "appserver.class" and mandated a reboot; our target server rebooted to a crash dump screen. Fortunately, though our plan was to install to our production servers from the CD-ROM server, we enacted this plan during our low-risk deployment, used an ERD (Emergency Recovery Disk) to recover and made other arrangements for patching the remaining servers.
Insist on Consistency
It's very important to maintain the same level of patch within your enterprise, though the consequences if you don't may not be immediately obvious. We've run into problems with WINS that did not manifest directly after patching, but were quickly resolved when the servers involved were updated with the same patch levels. It's not always possible to patch all servers quickly at the same level; when you can't, it's a good idea to patch functional groups together by domain or trust relationships.
Combinational mathematics and the glut of potential third-party software and hardware means that neither Microsoft nor any mortal agency can predict all possible reactions to a hot fix or patch. Certainly, standard hardware and software combinations, along with solid rollout and testing strategies, will minimize the risk of disaster. Still, you may very well end up with a batch of bad patch. By thinking ahead about how to deal with a catastrophe, you can give yourself a head start if you encounter such bad luck.
Relying on the ERD is not a surefire means of protection, especially if you've been lax with updates after changes or upgrades. If you're missing files after a patch, or if files--particularly non-Microsoft files--are corrupted, or you must revert to a different patch level, there are several ways to recover them if you get blue-screened.
If everyone used FAT (File Allocation Table)-based system partitions, post-disaster file-level repairs would be easy. You'd simply boot a DOS disk, and replace any necessary files by copying them from install media. If you required a large number of files, available network and removable media DOS drivers would make file copy recoveries effortless.
But the reality is that most security-conscious shops instead use NTFS, which further complicates your options; you can't just boot to DOS and arbitrarily read or write to an NTFS partition. For help, see Microsoft Knowledge Base Article Q164471, "Replacing System Files Using a Modified Emergency Repair Disk," which details how to modify an ERD so it will copy user-specified files back to the system drive, whether or not it's NTFS.
The Microsoft knowledge base also specifies ways you can install a parallel NT system into a different folder on the hard drive, return to a GUI and repair your main system. While these processes are interesting, they're also time-consuming and painful to implement. What's a systems administrator to do? Fortunately, some vendors have modified NT setup disks to enable reading and writing from an NTFS partition. These tools are well worth the couple of hundred bucks they'll cost you, given the savings in time and toil they provide.
One example is Stac's Replica disaster-recovery software, which modifies the NT setup disks to load your server's disk and tape drivers. It gives you the option of fully recovering your system from the last backup tape. Though this type of plug-and-play disaster recovery can relieve the agony of a very bad day, a full restore might not be what you're after.
If you're simply trying to disable a troublesome third-party driver by renaming it, or you need to copy a different version to your NTFS partition, try one of Winternals solutions. The vendor (formerly SysInternals) offers products such as ERD Commander, NTRecover and NTFSDOS, which can help you talk to a crashed system's NTFS partition.
ERD Commander works much like Stac's Replica, using modified NT setup disks to load character-mode NTFS drivers and launch a modified CMD. EXE (see "Third Party Utilities Infect NT Setup Disks" to the left). You can then read or write files to your heart's content.
NTFSDOS allows very limited access from a DOS boot disk. It can't handle fault-tolerant system drives, which makes it problematic for enterprise servers. But in our tests, NTFSDOS let us load DOS-mode network drivers in conjunction with the NTFS driver--something you can't do with any other tool we tested. The ability to access other servers at network speeds from a crashed server can only be a plus.
NTRecover lets you take a working NT server or workstation, connect a null modem cable to a crashed NT server, boot a floppy on it and mount its drives on the working server. The product can handle a number of files and treats the drive as a "real" NT drive. We tested NTRecover by running a "CHKDSK" on the target system's partition and it worked smoothly. But NTRecover's maximum line speed is 115 Kbps, so performing large amounts of data transfer isn't necessarily a wise move.
Jonathan Feldman is technical systems manager for the Chatham County Government in Savannah, Ga., and author of SAMS Teach Yourself Network Troubleshooting in 24 Hours. Send your comments on this article to him at jf@feldman.org.