Despite the best intentions, simple changes to infrastructure can turn into a Chevy Chase slapstick schtick. You know the routine???Chevy reaches for a glass on a table, knocks a candlestick, tries to catch it, knocks something else over, and so on. A few weeks ago, I decided to make a few changes to our network topology. I set aside two hours for the changes to take place and that included troubleshooting time. I thought I was being liberal. I was wrong.
We are renumbering some of our hosts, segmenting our network, and consolidating our critical production equipment into a single rack. About a week before the BIG DAY, I set our DNS TTL to 5 minutes. I had a plan written down detailing the steps. Shutdown critical services; change the IP addresses; shutdown the servers; our equipment into different racks; re-cable the network; power up and check. Simple, right?
What started out as 2 hours of planned downtime become 9 hours with Mike DeMaria and I trotting between consoles and the machine room. Luckily this happened between Christmas and New Years, so the impact was minimal, but still.
First, I forgot a critical step which was to manually trace and label all the cables we were moving so that in the worse case, we could quickly restore the network. Ok, my bad. We pull the firewalls, taps, and other hardware from their location and rack them up. We pull the cables from the existing equipment and start to re-cable them to a switch in the rack. DeMaria suggests that we rip out all the cables and just re-run them new. It???s already noon, when we should be back up. Nah, let???s just get the cables connected and move on. We can clean up later.
There goes lunch. Next, we stopped the critical servers and changed the IP addresses. I made sure that the licenses were not tied to the IP address, so no problems there. But we did run into a major issue with our mail server. We use Communigate Pro on Microsoft Windows 2000 Advanced Server and Microsoft Cluster Services. That migration did not go well at all. First off, we changed the IP address on the external NICs, leaving the internal NIC???s untouched. Cluster services stopped and wouldn???t start. So what does any MS admin do? Reboot. When the clusters came up, they couldn???t find the Compaq disk array. Well, that???s bad.
So I drop both clusters and bring up one. The disk array appears! Ok, so I go into the disk controller and there are some disk errors. The suggestion is to rebuild the array. Fine, let???s do that. While that is happening, I check that mail is running (it is) and that we are receiving mail (we aren???t). So this is a good time to reconfigure the firewall. I changed the IP addresses on the host objects and added some inbound NAT rules to translate the old IP addresses to the new ones. Yes, after a week of setting the zone TTL to 5 minutes, we were still getting connections to the old IP???s even though they had been changed hours ago. So we verify that the servers are visible by telneting to the ports from an external shell account. It works. Yay.