Surviving the Windows Server 2003 Cluster Bomb, Part II

You've done the courageous thing and decided not to quit your Windows Server admin job. But, you still have a mess to fix. Here's how to do it.

August 18, 2005

13 Min Read
Network Computing logo

The nightmare scenario is now playing itself out. You have pinched yourself a few times and still you do not wake from the horror. The cluster service is dead. Exchange 2003 and SQL Server are no longer operational. In a few hours, your users are going to be logging into work. Right now they cannot get their email and the company web users cannot place orders or track their shipments. Are you getting ready to change careers, or are you up to the huge task of rebuilding a totally destroyed cluster?

In Part I of this article, we described how the cluster database was unrecoverable and that there were no recent backups to be used to restore the cluster. We also told you that the only option was to rebuild the cluster from scratch and reinstall Exchange 2003 and SQL Server from the beginning. We also advised you to do a few critical tasks:

  • Back up the Exchange stores and databases on the shared disks.

  • Back up at least one of the Exchange nodes.

  • Backup the SQL Server databases (including the system databases such as Master).

Let’s be clear about one thing: As long as the cluster service, cluster databases and configuration data (quorums) are completely corrupt and you do not have backups, there is no other way to recover your cluster. At this point, however, as long as you backup your nodes and databases as described in the above steps (especially if you do not have recent backups), you will be able to rebuild the cluster and reinstall your applications.

Do not try and run Exchange on any of the nodes. The services will not run, and you’ll risk damaging the installations. Remember, the key to recovery of Exchange that was running in a cluster is not to do anything with the Exchange binaries and databases.So let’s begin. Cleaning up each node of the cluster is easy. Simply run the following command in the system’s Cluster directory:

C:WINDOWSCluster>cluster node CLNODE01 /forcecleanup

The label CLNODE01 is the only variable, the NETBIOS name of the cluster node. It only takes about half a minute or less for the cleanup to complete. Upon completion, the cluster database is restored to its original state before any cluster resources were installed. Repeat this exercise on the all the nodes. When you are done you’ll be able to start a cluster again and will soon have the shared disks, network names, IP addresses and quorum resource running.

With all nodes clean, you can install the base cluster services and make sure the cluster is running normally again. Setup the cluster as if the nodes have never supported a cluster. Fail over the cluster to each node and check to ensure that the cluster service is sound. With the cluster service ready for Exchange resources, you can now begin to restore Exchange to the cluster. If you have both Exchange and SQL Server to recover start with Exchange, it’s easier and quicker to recover, and at least you won’t have users hounding you if they can’t get to their email.

Step 1: Obtain the CD of Exchange 2003 Enterprise and begin the installation to the first cluster node. The installation should go quickly because the Active Directory schema for Exchange has already been done and the Exchange installation will simply use the existing data stores on the shared disks of the cluster. At this point you can open Cluster Administrator and confirm that Exchange resources are now available to the cluster. Don’t do anything with them yet.Step 2. Upgrade the installation of Exchange to the latest service pack. If the previous installation was not upgraded to the latest service pack, you can leave the upgrade out and proceed to the setup of Exchange resources on the cluster, or you can use the opportunity to upgrade to the latest service pack. If, however, you upgraded Exchange to its latest service pack during the original installation (before the nightmare began), then you must upgrade Exchange again and run the service pack installation. The reason this is important is that you technically only need to install Exchange to the one node to get the cluster resources back. Once the resources are back in the cluster, each node is viable again to service the cluster for Exchange, so you don’t want to have one node at an earlier level of Exchange and the others at a later service pack.

After Exchange service pack installation is complete, you can now proceed to install the Exchange Virtual Server in exactly the same way you first set it up. As soon as the EVS and other Exchange resources are running (such as the SMTP service), Exchange will be back and your users will be able to log back in to the mailbox stores. Trust me, you will be overcome with relief and joy.

Before you move on to SQL Server (or go home if you did not lose a SQL Server in the cluster bomb), failover Exchange to make sure the cluster is operating normally. Check the logs to make sure there are no critical errors.

[JEN, INSTEAD OF THE COLUMN HEADER HERE, THIS WOULD BE A GOOD SPOT TO KICK IT OVER ONTO A SECOND PAGE WITH THIS HED, IN WHICH CASE IT WON'T HEED THE AND
FORMATTING/DON]Fixing SQL Server

Now, onto SQL Server, which -- as mentioned in Part I of this article -- is an entirely different matter. With your databases backed up and copied to another folder, take the following action:

Step 1: Rename the SQL Server folders from the original installation. If you have backed them up or copied the folders you can delete the entire folder hierarchy. Do not bother to uninstall SQL Server from the Add/Remove Programs facility in Control Panel; it is not possible to remove a broken SQL Server in a cluster from Add/Remove Programs. You must also delete the Full Text Search folder.Step 2: Open the registry in order to remove all SQL Server keys from the operating system. The keys are as follows:

HKEY_LOCAL_MACHINESystemCurrentControlSetServicesMSSQLSERVERHKEY_LOCAL_MACHINESystemCurrentControlSetServicesMSSCNTRS HKEY_LOCAL_MACHINESystemCurrentControlSetServicesMSSEARCH HKEY_LOCAL_MACHINESystemCurrentControlSetServicesMSSGATHERER HKEY_LOCAL_MACHINESystemCurrentControlSetServicesMSSGTHRSVC HKEY_LOCAL_MACHINESystemCurrentControlSetServicesMSSINDEX

Also remove the following keys:

HKEY_LOCAL_MACHINESoftwareMicrosoftSearchInstallApplicationsSQLServer

orHKEY_LOCAL_MACHINESoftwareMicrosoftSearchInstallApplicationsSQLServer$

The second option will remove a named instance from the registry.

Finally remove the following key as well:

HKEY_LOCAL_MACHINESoftwareMicrosoft MSSQLServer

(this is a large application hive and you can simply remove it all the way up to MSSQLServer).Repeat these cleanup steps on all the cluster nodes. Once this is done, SQL Server is gone from your cluster nodes. Remember, the only remnants you keep are the databases and log files -- all of them.

At this point you can begin installing SQL Server again to the cluster. If the cluster is sharing Exchange with SQL Server, make sure to install on a node that does not own the Exchange EVS. Install SQL Server; the installation will automatically create the SQL Server resources in the cluster. Once SQL Server is installed, upgrade the node to the latest service pack.

Are you done? Not a chance. Unlike the first time you installed SQL Server, you now need to repeat the installation on all nodes of the cluster. You need to again install both the SQL Server binaries and the service pack binaries to each node.

Now here’s where some nasty stuff comes in with the SQL Server recovery. There a good chance (50/50) that SQL Server will, for some reason, crash in the cluster at the point the installation is trying to create the Full Text Search (FTS) resource. I don’t know why this happens: There is no documentation on the issue and Microsoft support cannot shed light on the reason for the failure either.

Here is how Microsoft expects you to get around the problem. If the SQL Server fails you have to go back into the node and clean out the registry again. You also need to delete the new installation folders that were created when setup installed the SQL Server binaries to the server (what a pain). You also need to restart the server, which is why you needed to move any live resources to other nodes.Now, reinstall SQL Server, but minimize the installation progress screens and make sure you can click in the Cluster Administrator. Carefully watch SQL Server creating the resources. The FTS resource is typically created at the very end of the installation. As soon as you see the FTS resource attempting to come on line, click it quick (you need to be a fast on the mouse button draw) with the right mouse button and then select Delete to whack the resource.

Your timing needs to be perfect. You have to click the resource as soon as it is created and as soon as the cluster service tries to start it up, and you have about ten seconds to do this. If SQL Server cannot start the resource, it simply fails the installation and every resource successfully created is backed out and removed. If you manage to delete the resource, SQL Server will move on and complete the installation, sans the FTS.

As soon as you have SQL Server resources running, you can then proceed to install the binaries on the other nodes. You have to do this because the installation on the first recovered node will not copy binaries to the other nodes as it did when you first installed SQL Server (why this is I don’t know). But you can start to recover the databases now and leave the installation of the FTS service for a later time, at your leisure if you are not using it.

Once all nodes have SQL Sever binaries and service packs, check on the integrity of the installation by failing over the resources. To recover the databases from the old cluster you need to stop all of the SQL Server resources (which stop the services). Copy the newly installed system databases and store them in a backup folder. Copy all of the old databases, including Master (all .mdf and .ldf files) to the new installation. Restart the SQL Server resources (in other words, bring them online). SQL Server will attach to the old Master and your original user databases will be recovered. At this point, SQL Server is back up and can start servicing clients again.

I might mention that you can see that this process for SQL Server can take a long time. If you see that recovery is going to run into daylight and you have web sites that need access to the data, install a stand-alone SQL Server on a non-cluster node server and start it on your old databases as described earlier. This will allow you to service your apps while you work on the restoration of the cluster. It takes no longer than one hour to install SQL Server on a single server, attach the original databases and change your web application configuration strings to query the new server name.To finally recover the FTS resource, you need to do the following:

On the primary node from where you first re-installed SQL Server to the cluster, locate the latest copy of the sqlstp(N).log file (N is the latest version of the log file). Search for the searchstp.exe and for ftsetup.exe. Copy and paste the following commands from this file to a notepad file and the name the file something akin to "installsearch.cmd." You are going to do manually what install was trying to do as part of its complete unattended install.

The commands in the file should be set up as follows (D is the path to your SQL Server CD):

D:x86FullTextMSSearchSearchSearchStp.exe /s /a:SQLServerD:x86FullTextftsetup.exe SQLServer SQLVS01Instance 1 0 1 0

Execute the "installsearch.cmd" file using the following command from the command prompt.c:>installsearch.cmd > report.out

Make sure that ‘report.out’ does not return any error (0X0 - for successful). As soon as this is done you can add the new FTS resource to the SQL Server group from the Cluster Administrator. In the resource, from Properties, uncheck "Affect the group" check box for this resource.

Now open the registry on the node you are working on and go to "HKEY_LOCAL_MACHINECLUSTERResources.” Search for Full-text search resource, and then click parameters. Create two string keys under this key.

ApplicationName: SQLServer (default instance) or ApplicationName: SQLServer$ (a named instance)ApplicationPath: Full path to FTDATA.

Now failover to second node and repeat the above steps on the second node. Ensure that during this time the second instance of SQL Server is moved to the other node. Once you have installed FTS to each node you can try bring the FTS resource online. (If it fails to come online on node 2 try fail-over to the first node and fail it back, this seems to do the trick.)Finally you need to install the service pack for FTS. Take FTS offline and execute the following on each node after moving the SQL Server resource on each node. You can get the command from the sqlstp(N).log file.

D:sql2ksp3x86FullTextMSSearchSearchSearchStp.exe /s /a:SQLServer

Bring FTS online and test fail-over and fail-back.

If you make it to this point you will have recovered your cluster running both Exchange 2003 and SQL Server 2000, the whole shebang. Before we leave, let’s take some time to prevent having to do this again in the future by backing up the cluster properly.

[AND THIS COULD BE A THIRD PAGE; DITTO ON THE FORMATTING/DON]Prevention Is Better Than A Cure
As you can see, recovering from a total cluster failure is not a pleasant experience, especially when SQL Server is involved. To avoid going through this in the future make sure you have good reliable backups and take them on a regular basis. The smartest way to backup your cluster is to use the built-in Backup or Recovery Wizard (a.k.a NTBackup) provided with the operating system. NTBackup is cluster aware.

In case you missed this in Part I of this article, there are four groups of data that a cluster needs to operate properly. These are as follows:

  • Cluster disk signatures and partitions (use ASR and NTBackup)

  • Quorum data (use NTBackup)

  • Data on each cluster disks (this can be achieved using a 3rd party application)

  • Data on the individual cluster node (use NTBackup)

As soon as the cluster is operational and before you begin to back up any data on the server cluster nodes, backup the cluster disk signatures and partitions. This is achieved using Automated System Recovery in the Backup Wizard. You will later need to restore the signature of the quorum disk, which is essential to recover from complete cluster failure.

Next make sure you regularly backup the cluster quorum. The cluster quorum contains the current cluster configuration, application registry checkpoints, and the cluster recovery log. Most cluster system failures occur because the quorum data is lost. Use the Backup Wizard to back up the cluster quorum data. System State backup from any node backs up the quorum.

You only need to backup system state from one node. It is not necessary to back up the quorum on the remaining cluster nodes. It does no harm, however, to back up the clustering software, cluster administrative software, the system state, and the application data on the remaining nodes.

Now, you are done. Wipe your brow, grab a bite to eat and call the office to tell them you are going home to get some sleep. And cancel that trip to the Alaskan Crab employment agency. Working with Windows Server 2003 is still more rewarding, as long as you don’t have a contract out on your life.

Server Pipeline columnist Jeffrey R. Shapiro is the co-author of Windows Server 2003 Bible (Wiley) and is an infrastructure architect who manages a large Windows Server network for an insurance firm.0

SUBSCRIBE TO OUR NEWSLETTER
Stay informed! Sign up to get expert advice and insight delivered direct to your inbox
More Insights