While internet connectivity as a whole is resilient, the fabric that our users and applications traverse is susceptible to outages.
At SolarWinds, we cover our tradeshow booths with IT geek buttons and stickers that lament, celebrate and cheekily commiserate the day-to-day frustrations of IT admins. A perennial favorite is a laptop sticker that correctly tells users, “No, the internet is not DOWN." The internet by design can’t be down -- obvious insider jokes are hilarious. But is that really true? What if we each have our unique internet that can most certainly go down?
Mud and fiber
The likelihood of an actual hard failure of primary internet connectivity has become remarkably low. High-bandwidth stand-by redundancy and automated failover have become commonplace for even medium-size businesses. True, the occasional construction crew can get us called to the CIOs office when a backhoe severs multiple fibers and repairs may take many hours in a muddy ditch. That’s about the only way to knock us truly offline, however, and not the sort of outage we generally worry about. Instead we face multiple silent cuts, sometimes many times a day, without knowing it.
The greatest cause of transmission failure and performance issues is the very robustness of the internet itself. At any given time, even basic transmission paths between endpoints in the same metro area may have thousands of potential routing permutations, determined one packet at a next-hop time. It’s an indeterminate routing table chess game, played by all the participating systems in the path. And where once hop option proliferation was limited by the cost of reconfiguring carrier networks, that’s no longer the case. Carriers now programmatically manage their internal networks, creating and destroying multiple virtual routes during the day in response to specific demand.
And when the basic fabric that we increasingly depend on for critical service traffic is so mutable, it’s too easy for a single errant link to create real end-user headaches that are difficult to troubleshoot, especially outside your firewall. For example, if your primary ISP maintains active-active load balancing of four multi-homed paths between backbone hubs, one might feel reassured. But if just one of those links starts dropping packets due to congestion, users would experience a 25% packet loss, silently.
Old tools don’t help
Traceroute does what it can, bless its heart. But it’s reporting only the next-hop decisions at the time of execution. A typical first moment of real-world confusion for new admins is seeing a different traceroute reported only minutes after testing again. Ping, in all its simple wonder, is actually better at demonstrating asymmetry when it briefly reveals telltale blips of an affected multi-homed link: 5ms, 6ms, 5ms, 120ms, 4ms, 7ms, 9ms, 3ms, 154ms, 3ms, 5ms, 8ms, 261ms. Can you see it?
The daily tools many engineers rely on simply weren’t designed for the modern, essentially unknowable internet, where a single transient issue in thousands of route permutations may drive calls to your helpdesk. It’s too easy to feel external routing is out of our control, that we’ve done the best we can, and hope for protection and safe passage from an ISP-associated saint. (I know an admin with a St. Isidore of Seville figurine buried upside down in front of his NOC.)
Your Internet snowflake
Your internet is a relatively infinitesimal collection of possible Internet routes that your applications and end-users actually traverse. With modern network monitoring technology, you can discover and visualize not just your internal, more deterministic routes, but also the superset of possible links outside your firewall. With a little discovery, you can walk the curious trails your packets follow, over miscreant ISP overflow shunts and around dead virtual links they may not even know are down.
Over time, it’s amazing what happens when you seek the enlightenment of your internet. It’s more predictable than you might think as the regular ebb and flow of route changes begin to follow a pattern in each overall route. Your systems can learn to issue smarter alerts -- not just when total latency is exceeding a threshold, but when something about the nature of the delay is novel. You may even find that actively monitoring beyond your firewall gets you invited to an ISP insiders' club. Ever start a call to your ISP support desk with the exact names of their router causing packet loss? They stammer a bit, then ask a lot of questions about your methods, and finally you hear the awe. It’s glorious.