The router glitch that grounded hundreds of United Airlines flights for a couple hours last month led some network pros to share their war stories on Reddit. One confessed that he cut the wrong cable, halting an estimated $3 billion in stock trades in Chicago. Incredibly enough, while he didn't get fired, his boss did.
We all like to tell and listen to these stories about networks going belly up, but you don't want to be the person causing an outage. In this blog, I'll offer some tips for successful network configuration and change management.
Don’t blindly follow orders. Don’t expect managers to be technically savvy or to have enough knowledge of a network to be able to assess how big of an impact a change will make to the network.
For example, a manager might say upgrading and rebooting a core switch isn't a problem because there's a redundant switch. But upgrading a core switch during work hours can be very dangerous even if there is redundancy. It’s not uncommon to hear stories where rebooting the primary switch has triggered a bug that took the secondary switch offline or put the network into a brittle state. These kinds of upgrades should be performed off hours.
Meanwhile, the networking pro in Chicago could have probably avoided the network outage if he'd done a little bit of research and referred to documentation before clipping the cable.
Make a plan. A change plan does not always have to be written down if it’s a small job, but create a plan in your head for the work order needed to execute the change. Even for small changes, think of the steps to take, how to verify that the change was successful and how to roll back in case there is a problem. Even just adding a VLAN can cause unexpected events. For more complex changes, make a written plan of all the required steps, commands, and the exact rollback process. Define how much time each step should take and delineate a breaking point where things have to work or rollback will be started.
Peer review. Get a second pair of eyes on your plan; it’s always good to get a second opinion. That does not mean that you should always second guess your work, but another person may have some experience that will help avoid problems. This person might have already run into some bug/issue or learned something that you haven’t yet. There’s no shame in getting a second pair of eyes on your plan!
Change advisory board. The dreaded CAB! We engineers don’t really like ITIL, which includes the CAB concept, because we think it slows us down and it can be an administrative nightmare to go through the process. However, in some networks, some form of CAB is necessary.
I work a lot with civil aviation companies and their networks, where shooting from the hip is not an option. Mistakes could put people’s lives at risk, add stress for flight planners, and potentially lose a lot of money. Any change to the network has to be assessed what kind of impact it will have on all the systems that ride on the network. You don’t have to implement a full-scale CAB; it could be enough to have a policy saying that all changes need to be reviewed by another team member.
Know where you are coming from. If you don’t know what the network looked before, how can you tell if it is working after the change? You can’t! Before implementing the change, look at the utilization of the links: What does the MAC address table look like? What does the routing table look like? The more data you have, the more confident you can be that you haven’t accidentally caused the network to lose connectivity after your change.
Keep backups. Sometimes things just break; it may not even be something you did wrong. You rebooted the switch and it came up blank; no configuration! How fast can you recover? You do keep a copy of the configuration for all network devices, right? Do you also backup the software of the devices? How well documented is your network? Could you recreate the configuration from documentation if you had to? Getting bitten once is enough to make backing up something you'll never avoid or forget again.
Automate. Most outages are caused by humans -- not bugs, power issues or anything else. We make mistakes, it’s human. What can we do to prevent the mistakes from happening? Automate! Create a script or program that will configure the changes for you and verify; this is almost a must for any large-scale change. Test it first, though. The so-called blast radius is a lot bigger when automating changes. I have heard about people accidentally erasing the configuration of 200 switches before detecting the problem. Not fun at all!
To automate something you really need to think through all the steps to take and how to verify them. This can be a fruitful exercise. It also can be helpful to save output from commands, and logs in a database so that you do analysis if something goes wrong when executing the script. Try it on one or a few less-critical devices at first, if possible, then move on to the other parts of the network.
Sharing stories about network debacles is fun, but not if they're about you. How a person handles network changes can differentiate a good engineer from a great engineer.