In my last post, I suggested that we change the way we configure the network by emulating the application development model of continuous integration or deployment. While I covered many of the prerequisites for making a move in this direction, I didn’t talk about implementation or the pipeline for deploying network changes.
While we can borrow many things from the application space in terms of deployment of changes, we can’t easily replicate the testing functionality from applications to the network. The reasons for this are numerous, but mostly focus around scope, dependencies, and environments.
In the network space, defining the scope of a change or a test is hard because of the dynamic nature of most networks. Couple that with the fact that we’re dealing with lots of different platforms and protocols, and the dependencies for a test can be hard to determine. Moreover, most of us don’t have access to development networks that mimic the production network. Even if we did, we likely don’t have real- life traffic running across the development network to simulate a production network.
This is not to say that running tests on a network is something that we don’t do today. Developing a test to validate a network configuration change isn’t hard to do. Most of us making large-scale changes have detailed checkout plans that involve verifying the state of devices throughout the change window. Developing the test plan can many times become more work than planning the actual change. The problem is that these checkouts involve manually touching many devices and inspecting them for specific details. In other words, these manual methods lack any sort of automation and are prone to missing issues that occur as a byproduct of the change. Changes would be much easier if we could find ways to dynamically audit the state of the network throughout the change.
Many might say that we do this today in the form of SNMP alerts and syslog messages. If the NOC gets an alert during our change window, we know we broke something. I’d argue that in today’s network, these are no longer valid approaches for end-to-end network testing and validation. Not only are these alerts often misleading, they are also prone to being missed or mischaracterized as smaller problems.
Many times, monitoring focuses on the application level-probes. However, we have to consider that while directly related, the health of the network is not directly correlated to the health of the application. Validating that the application is healthy has very little to do with validating that the network is healthy. The hardest issues to catch are the ones that degrade the network, but don’t fully cripple it. These issues, when introduced during off-peak hours, often go unnoticed until the next day when traffic ramps up during peak hours.
What we really need to focus on is device and network state pre- and post-change. For instance, consider a simple change in which you advertise a new prefix into the WAN. You advertise the prefix, log into a couple of remote routers, and make sure they see the new route. The next day, you get a call because one of your WAN sites is dropping massive amounts of traffic. After looking into the issue, you realize that only one of the site routers accepted the prefix, causing it to become saturated as it was taking all of the traffic rather than spreading the load across the others.
This sort of problem could have been avoided by identifying key metrics and checking the state before and after the change. Before the change you may have had 63 prefixes on each remote site router, learned from an eBGP peer. Afterwards, you would expect to have 64. If a site reports back a state that doesn’t not match the change outcome, you know where to start looking. This sort of automated testing methodology would catch many of the “day after” issues incurred after a change.
While the state verification I’ve described above sounds like a great idea, we’re pretty far from that being a reality. The majority of networking vendors don’t offer easy ways to query device state. In addition, most networks are made up of many different types of devices from potentially different vendors. Some devices in the data center space offer APIs to make querying device state rather straightforward. And while I’m not aware of any specific tools designed to perform state checking, there are many of us in the network space trying to solve this problem. So while there isn't a silver bullet for network testing, we are making progress and I expect this area to get continued focus.
Looking for more hot technology trends? Learn about the Future of Networking at a two-day summit presented by Packet Pushers at Interop Las Vegas, May 2-6. Don't miss out -- Register now!