Surviving the Windows Server 2003 Cluster Bomb

A complete cluster crash is a server admin's worst nightmare. Our intrepid columnist describes his, and how he dealt with it -- and how you can avoid it.

July 19, 2005

12 Min Read
NetworkComputing logo in a gray background | NetworkComputing

Here's a nightmare scenario: You bring down your seemingly infallible Windows Server 2003 cluster nodes for maintenance and boom, the cluster explodes into a million pieces. Do you leave the country, commit Seppuku, check for one way passage on an Alaskan crab boat, or call in the medics for urgent resuscitation? No need to go into exile; just save this two-part article for the day when you hear that eerie whistle of a cluster bomb about to land.

Let me tell you about a cluster bomb that fell earlier this month while my wounds are still swollen. A client called me to help move an entire rack of servers from one data center to another. Humming along in this rack without a care in the world was an active-active cluster setup with nodes running Exchange 2003 and SQL Server 2000. This was a single cluster in which any one of the nodes would be able to carry either Exchange 2003 or SQL Server 2000 should one of the hosts crash. On this fateful day the entire cluster crashed. (The fallout would still be hanging over downtown Miami, if not for hurricane Dennis.)

Here’s what happened. Step 1 at 08:00 hours, Sunday July 3, 2005: We shutdown the servers. Access to the cluster was stopped at the DMZ so clients no longer had access to email or databases. This was planned ahead with a notice of scheduled maintenance well in advance. There was no traffic to the servers. Each node was carefully shutdown until the last node carrying all the resources was also powered off. After the cluster was silent the domain controllers, storage and other infrastructure servers were shutdown. When the entire rack was dark we proceeded to move all the servers to the new location and rack them in their new home. Step 1 complete on time in time for breakfast.

Step 2 at 10:00 hours. The servers are now racked in their new home, all network cables in, switches installed and powered up, rack console glowing but waiting for the first servers to come to life. First we brought back the domain controllers (DMZ still inaccessible). Storage and all infrastructure servers came back fine and checked into the domain. The noise in the rack was now deafening (Dell servers omit a howling scream when they first start, and that sound can send a chill down the spine). Finally all resources the clusters depend on were ready. The last step was to bring up the nodes.

Step 3 at 11:00 hours. Ka . . . . . . boom; and we still don’t know what happened. We only know that after the dust had settled on the first node we found no sign of life coming from the cluster service. The server was quickly given last rights and we turned to the other nodes. To spare you the painful details the entire cluster was dead. Not one of the nodes would start independently. The cluster database on each node was completely destroyed. Such an event can fry the brain of a green server administrator. This is the time the meek MCSEs are separated from the battle hardened pros. Those who cannot stomach the event need to be sedated quickly and airlifted out of the data center.With the cluster completely gone we took a deep breath, gathered up the collective brain power in the conference room and considered all options to get operational as soon as possible.

The first thing that needs to be said is that you cannot panic; it only makes matters worse. It is critical to remain calm even if, as in our case, you are only about 20 hours away from the dawn of the new business day and the CEO of the company is a seven foot tall, Irish rugby player, who benches 250 lbs before breakfast each day. It is not hard to imagine what life will be like the next morning when he gets into work and he can’t access his email.

On a white board we chalked up all the options we had for recovering the cluster before sunrise. Here’s what we listed.

Using cluster utilities attempt to recover the cluster. It that does not work try option 2.Restore the cluster database from the backups. If that does not work then try option 3.Clean up the nodes and rebuild the cluster. Reinstall Exchange and SQL Server on the nodes. This is the last resort. If this does not work start calling the Alaskan Crab Boat employment agencies.

The above options should be allocated a certain amount of time for completion, especially if you are working against a deadline. If the cluster is recoverable using any available utility it will take no longer than 30 minutes to complete option 1. In other words you will either recover the cluster or realize within half an hour that what you have is DOA and you need to move onto the next option.Option 2 is most likely as far as you will need to go but you should dedicate an hour to two hours to complete this task and restart the cluster (depending on where the backups are located). That is; if you have reliable recent backups of your cluster database, recovery of the cluster database is likely within the hour if the backups are nearby or perhaps two hours if the backups are on tape at another location (more about the cluster database backups later).

If your backups are also insufficient or “missing” then option 3 is your last resort. Dedicate about an hour to reinstall the cluster (of course you need to be a trained cluster administrator to reinstall a cluster). Then dedicate about three hours to recover Exchange on a cluster and about ten hours to recover SQL Server. I will discuss why SQL Server takes so long to recover in a few minutes.

So what we have on our white board now is this. If the cluster is salvageable it will take up to three hours to recover it using recovery tools or back ups. If the cluster has to be rebuilt it will take anywhere from four hours to a day and a half to recover the cluster and get all services backup and running. So the total amount of time to recover from a total cluster failure is anywhere from three hours to 16 hours (without rest, food, or relief of any kind for the people involved).Now, why have I noted that SQL Server can take as long as ten hours to recover? It should not be this way. As easy as it is to install SQL Server on a virgin cluster (easier that Exchange 2003) the opposite is the case if you have to rebuild the cluster and reinstall SQL Server on the same nodes (even after cluster cleanup). You cannot simply install SQL Server again. You first have to remove all remnants of SQL Server from the nodes before you can reinstall the service and even after this SQL Server may not make a full recovery. Let’s first look at the initial options.

The cluster utilities you have to recover a brain dead cluster are far a few between and not one is the holy grail of cluster repair kits. The first tool you can use to detect any sign of life from the cluster is the Cluster Diagnostics Tool (clusdiag). This tool can help troubleshoot cluster problems. If the cluster is not starting because some dependent resource (such as storage) is down, clusdiag will report this. However if the cluster database is damaged you will not get much help from clusdiag to restore it.

The cluster utility (cluster.exe) offers limited help recovering from failure. It is only useful to clean up a dead cluster database and restore the cluster database (clusdb) to its virgin state. Again if the cluster database is corrupt you don’t have much choice but to restore it. A bad clusdb cannot be fixed with packing, reindexing or something similar. Its not an Access database or a something like a config file.The cluster database and the content of associated files (like the quorum logs) are distributed. Each node gets an identical copy of the database so that it can operate in the cluster when it is called into failover duty. Thus, if the cluster database is dead on one node it is almost certainly dead on all the nodes. You might be in luck if a cluster node was taken out of the cluster (through shutdown rather than eviction) before the corruption, but the chance of that fortune befalling you is slim.

It is thus critical to keep daily or very recent backups of your cluster database because recovery of a failed cluster lost due to corruption of the cluster database is only possible through restore. However, you can’t simply backup the cluster database. Many backup tools that simply backup files do not backup the cluster database. Here’s why.

The Windows Server 2003 cluster database contains the cluster state data that is replicated among nodes of a server cluster. This ensures that all nodes have a consistent configuration. However, the clusdb database is actually the cluster hive of the registry. So because of the distributed nature of clusters, backing up a local cluster database contained in the clusdb hive is not sufficient to ensure that the full cluster state has been saved. The data is stored in a number of places in the quorum resource and the database and you need the proper backup tool to obtain reliable backup data.

There are four groups of data that are critical to the proper operation of Windows Server 2003 clusters. These groups are as follows:

The cluster disk signatures and their partitionsThe cluster quorum dataThe actual data on the shared cluster disksThe data on the individual cluster node Before you can back up any data on the server cluster nodes, you need to make sure that you backup the cluster disk signatures and partitions. This is achieved using Automated System Recovery (ASR) in the Backup Wizard. Backing up this data is mandatory if you later need to restore the signature of the quorum disk. You will lose the quorum disk signature if, for example, you are caught up in a complete system implode as described earlier. Inevitably, the signature of the quorum disk will have changed since you last backed up.You should always activate ASR. It will be invaluable to help you recover a dead cluster. The ASR is a two-part recovery process consisting of ASR backup and ASR restore. These tools are accessed in the Backup or Restore Wizard in Advanced mode. We will return to backup and restore of clusdb in Part II of this article. I will provide a detailed discussion of backing up and restoring the cluster data.

Now back to cluster triage. Unfortunately our case here has taken a turn for the worse. Doing diagnostics on the server nodes, we were quick to discover that the cluster data was corrupt and irrecoverable using any utility. We quickly turned to the backup restore option. Again, another death blow to our cluster. The technician backing up the clusters was not backing up the cluster data (system state) as required. He was simply backing up the server using our standard server backup system. We thus did not have reliable cluster data to recover our cluster.

We thus found ourselves stuck with option 3. The old cluster was given its last rights and steps were taken to salvage services to be donated to the new cluster. At this point it must be noted with extreme emphasis that while the cluster is dead, the resources and applications that relied on it will not be. In the case of Exchange 2003, the mailbox stores and Exchange configuration data in the registry and Exchange databases and logs are all intact. Make sure you back up the active node and data disks that were intact at the time of the cluster failure.

If you have a failed cluster carrying Exchange 2003 or SQL Server 200 or both then take note of what I am about to say. DO NOT DO ANYTHING to your Exchange binaries, databases, logs and so on. In short leave Exchange on the failed cluster nodes completely alone and do nothing but backup the node and Exchange data on the shared cluster disks.

Now SQL Server is a little different. Your old SQL Server 2000 system cannot be resurrected. All, however, is not lost with SQL Server. You still have access to all your SQL Server databases. So make a copy of NOW of all databases that were attached at the time of the failure. You are going to need them for the restore of SQL Server. These databases include all system databases such as Master and TempDB.Also make copies of your databases’ transaction logs. In short backup the entire data directory of your SQL Server installation on the shared disk resource. Once you have completed the backup rename that data directory so that nothing can overwrite the data in it. You should rename the entire SQL Server installation folder on the shared disk. (Typically it will be something like S:Program Files\SQL ServerData or something similar. Now change it to S:Program FilesOld SQL Server Data or something similar.)

You are now ready to reinstall your cluster and reattach Exchange 2003 and SQL Server 2000 to it. This process will be discussed in Part II. The night will be long and arduous. In the end you will recover your cluster, Exchange and SQL Server. In my case we had three thousand users that were going to connect to Exchange in less than 10 hours and the SQL Server databases were going to be needed to service more than two thousand customers. The smell of the ocean and Alaskan crab bait was unmistakable in my mind as I considered my fate should we fail.

SUBSCRIBE TO OUR NEWSLETTER
Stay informed! Sign up to get expert advice and insight delivered direct to your inbox

You May Also Like


More Insights