Here's a nightmare scenario: You bring down your seemingly infallible Windows Server 2003 cluster nodes for maintenance and boom, the cluster explodes into a million pieces. Do you leave the country, commit Seppuku, check for one way passage on an Alaskan crab boat, or call in the medics for urgent resuscitation? No need to go into exile; just save this two-part article for the day when you hear that eerie whistle of a cluster bomb about to land.
Let me tell you about a cluster bomb that fell earlier this month while my wounds are still swollen. A client called me to help move an entire rack of servers from one data center to another. Humming along in this rack without a care in the world was an active-active cluster setup with nodes running Exchange 2003 and SQL Server 2000. This was a single cluster in which any one of the nodes would be able to carry either Exchange 2003 or SQL Server 2000 should one of the hosts crash. On this fateful day the entire cluster crashed. (The fallout would still be hanging over downtown Miami, if not for hurricane Dennis.)
Here’s what happened. Step 1 at 08:00 hours, Sunday July 3, 2005: We shutdown the servers. Access to the cluster was stopped at the DMZ so clients no longer had access to email or databases. This was planned ahead with a notice of scheduled maintenance well in advance. There was no traffic to the servers. Each node was carefully shutdown until the last node carrying all the resources was also powered off. After the cluster was silent the domain controllers, storage and other infrastructure servers were shutdown. When the entire rack was dark we proceeded to move all the servers to the new location and rack them in their new home. Step 1 complete on time in time for breakfast.
Step 2 at 10:00 hours. The servers are now racked in their new home, all network cables in, switches installed and powered up, rack console glowing but waiting for the first servers to come to life. First we brought back the domain controllers (DMZ still inaccessible). Storage and all infrastructure servers came back fine and checked into the domain. The noise in the rack was now deafening (Dell servers omit a howling scream when they first start, and that sound can send a chill down the spine). Finally all resources the clusters depend on were ready. The last step was to bring up the nodes.
Step 3 at 11:00 hours. Ka . . . . . . boom; and we still don’t know what happened. We only know that after the dust had settled on the first node we found no sign of life coming from the cluster service. The server was quickly given last rights and we turned to the other nodes. To spare you the painful details the entire cluster was dead. Not one of the nodes would start independently. The cluster database on each node was completely destroyed. Such an event can fry the brain of a green server administrator. This is the time the meek MCSEs are separated from the battle hardened pros. Those who cannot stomach the event need to be sedated quickly and airlifted out of the data center.