Simple changes, big head aches

Despite the best intentions, simple changes to infrastructure can turn into a Chevy Chase slapstick schtick. You know the routine???Chevy reaches for a glass on a table, knocks a candlestick, tries to catch it, knocks something else over, and so...

Mike Fratto

January 16, 2007

6 Min Read
Network Computing logo

Despite the best intentions, simple changes to infrastructure can turn into a Chevy Chase slapstick schtick. You know the routine???Chevy reaches for a glass on a table, knocks a candlestick, tries to catch it, knocks something else over, and so on. A few weeks ago, I decided to make a few changes to our network topology. I set aside two hours for the changes to take place and that included troubleshooting time. I thought I was being liberal. I was wrong. We are renumbering some of our hosts, segmenting our network, and consolidating our critical production equipment into a single rack. About a week before the BIG DAY, I set our DNS TTL to 5 minutes. I had a plan written down detailing the steps. Shutdown critical services; change the IP addresses; shutdown the servers; our equipment into different racks; re-cable the network; power up and check. Simple, right?

What started out as 2 hours of planned downtime become 9 hours with Mike DeMaria and I trotting between consoles and the machine room. Luckily this happened between Christmas and New Years, so the impact was minimal, but still.

First, I forgot a critical step which was to manually trace and label all the cables we were moving so that in the worse case, we could quickly restore the network. Ok, my bad. We pull the firewalls, taps, and other hardware from their location and rack them up. We pull the cables from the existing equipment and start to re-cable them to a switch in the rack. DeMaria suggests that we rip out all the cables and just re-run them new. It???s already noon, when we should be back up. Nah, let???s just get the cables connected and move on. We can clean up later.

There goes lunch. Next, we stopped the critical servers and changed the IP addresses. I made sure that the licenses were not tied to the IP address, so no problems there. But we did run into a major issue with our mail server. We use Communigate Pro on Microsoft Windows 2000 Advanced Server and Microsoft Cluster Services. That migration did not go well at all. First off, we changed the IP address on the external NICs, leaving the internal NIC???s untouched. Cluster services stopped and wouldn???t start. So what does any MS admin do? Reboot. When the clusters came up, they couldn???t find the Compaq disk array. Well, that???s bad.

So I drop both clusters and bring up one. The disk array appears! Ok, so I go into the disk controller and there are some disk errors. The suggestion is to rebuild the array. Fine, let???s do that. While that is happening, I check that mail is running (it is) and that we are receiving mail (we aren???t). So this is a good time to reconfigure the firewall. I changed the IP addresses on the host objects and added some inbound NAT rules to translate the old IP addresses to the new ones. Yes, after a week of setting the zone TTL to 5 minutes, we were still getting connections to the old IP???s even though they had been changed hours ago. So we verify that the servers are visible by telneting to the ports from an external shell account. It works. Yay.Back to the cluster. The array is re-indexed and looks good. We have a failed drive, but that???s OK with me. So let???s see if we can get the cluster up. It???s about 3:00 pm now. We were supposed to be done by 12:00. Did I tell you my phone was ringing by this time? It???s just those pesky users wonder what the heck is going on. It???s too late to revert back to the ways things were. So DeMaria and I plod on.

Ok, we try to bring the cluster back on-line, no joy. The drive array is showing up, but the cluster service is dying and nothing is starting. It???s moving past 4:00 pm. Cryptic message are filling the event log. Decision time: 1) Continue to troubleshoot the cluster server issue or 2) just install Communigate Pro on a single server and be done with it. We decide for option 2. We can troubleshoot Cluster services later. Luckily, we are only using 4 GB for our mail and that will fit onto our server disks.

So I copy over the files from the RAID array to drive c: which took f-o-r-e-v-e-r. I also copied the files to our storage server because at this point my priority is to restore service and fix it later. If I have the files centrally located, then I can install anywhere. I install Communigate Pro, start the service and it fails. I run a "netstat ???anp tcp" on the mail server and no mail ports listening (25, 143, etc). Ok, I give up. It???s heading for 6:00. My worst fear is an all nighter. I am not leaving until it???s working.

I call Communigate tech support and get right through. They troubleshoot the problem in under a minute???it was a configuration problem. I make the changes and bingo! Thanks and see ya later. Now let???s try to connect via IMAP or the web UI. No joy. I call back, "Hi it???s me again." They tell me what other changes to make. They also want me to know they are having a training session in Dallas, if I wanted to attend. Intriguing, but I will probably have to pass. Thanks again, mail is up. All that is left is to reconfigure out Barracuda Spam filter. As DeMaria put it when he checked mail, "I got spam, so mail is working." By 7:00, I am making some final tweaks to our network and documenting the changes. I won???t remember tomorrow.

Lessons learned? Obviously I didn???t do my homework properly or I would have forseen some of the problems we ran into. After the dust settled, I did find some KB articles on Microsoft.com about troubleshooting changing IP addresses on a Cluster box. I will have to try those later. I also probably need to get more familiar with Communigate Pro. Left untouched, it pretty much runs itself, so that is good. But I wouldn???t be able to do much troubleshooting on my own just yet.Dave Greenfield, Network Computing???s Editor, asked the following week, "So what was the point of making those changes?" Good question. Mostly I wanted to segment our critical servers from the rest of the production network in order to protect the critical network and servers from changes to the rest of our network. We have accomplished that. We will also be better able to change our production network in the future. The rest of our changes are less radical and should go more smoothly. At the very least, they won???t impact our critical services.

About the Author(s)

Mike Fratto

Former Network Computing Editor

SUBSCRIBE TO OUR NEWSLETTER
Stay informed! Sign up to get expert advice and insight delivered direct to your inbox
More Insights