In the first six months of 2023, tracking Internet, cloud, and network-based incidents, something interesting happened.
Raw outage numbers continued to rise, just as they’ve done over the past few years, but two trends in those numbers stood out.
First, the ratio of Internet Service Provider to Cloud Service Provider (ISP:CSP) outages changed slightly. Though ISPs still account for most outages, the ratio has shifted from 89:11 over the same period in 2022 to 83:17. Part of this increase in CSP outages may have to do with the coincident trend of a general increase in deployed cloud infrastructure. Infrastructure growth in many parts of the world is exponential, particularly as demand for compute and data transmission capacity skyrockets with the AI boom.
Second, and more importantly, the nature of outages in the first half of the year also changed.
The big blast radius outages seen in recent years, where a network or cloud disruption had a cascading impact on dozens of other high-profile apps or services, all but evaporated for the first few months of this year. They were instead supplanted by a large number of smaller, more contained outages and disruptions.
With that comes a fundamental alteration in the way organizations - such as application owners - will have to approach detection and incident response.
Into the unknown world of smaller outages
The anatomy of big outages has become something of a known quantity. They’ve become a tolerated, if not accepted, consequence of doing business on the Internet.
The impact of these incidents can be so big that they have a material, reportable impact on uptime for a significant subsection of users, potentially for the entire user base. Often with quite a bit of collateral damage, and as a result, they're often publicly reported on by news outlets and analyzed in depth by the responsible party.
App makers caught up in these incidents are able to point to the upstream cause, and the rest is more or less out of their hands. Typically, there's general recognition of the issue at hand and a degree of patience in waiting for a resolution (or rollback) to filter back down.
But in a world with fewer big outages and greater numbers of small-scale disruptions, this ‘luxury’ no longer exists. In the absence of an easy explanation - a common denominator or obvious upstream cause - organizations and app makers now have to do more legwork to pinpoint the root of a degradation or disruption that starts to impact their operations.
In a world of smaller, highly contained outages, the question, 'Does the root cause rest with me or someone else?' still needs to be answered, but with added expediency, because end users want to know. End users tend to accept the odd outage, but more so if there's transparency about where the fault lies and they're kept informed of what's going on. Information is power: it can be used to determine an alternate course of action, such as switching to a backup service offering for a few hours until the disrupted service recovers.
To give end users this level of insight, organizations will require, in many cases, an improved ability to correlate what they're seeing to understand where the fault domain is and the magnitude - whether it's impacting a "small subset" of an upstream service provider's infrastructure or users, or just them.
A greater degree of vigilance is also required because odds are, they are - or will be - experiencing more small outages more often that are so contained and localized in impact that they might not always register on the status page of an upstream provider, or an analysis to understand what went wrong.
When there is no more relying on notification and analysis from upstream providers to explain an outage, the onus is firmly on affected organizations to undertake their own independent monitoring, diagnosis, and assessments.
Containing small outages
So, we know outages are smaller and more frequent, but why? One theory is that cloud service providers (CSPs), Internet service providers (ISPs), and network operators have all become much more adept at being software-defined in the way they manage and update their infrastructure. This has had a few impacts, notably that infrastructure changes appear to be smaller and occur more frequently because providers can effectively model how these updates will land ahead of time.
Predictable change management increases confidence in rapid code deployments. It also means that in the event that a change has an anomalous or unforeseen impact, the deployment can be rolled back to a prior stable version relatively efficiently. Often, this detection and rollback is automated, contributing to further containment of an issue. While an outage still occurs, the magnitude is smaller, and the blast radius is contained. There’s a much lower chance of a failed deployment in one service or area having a domino effect on others.
For organizations that purchase access to cloud and data transmission services, the need to independently understand this constantly changing infrastructure landscape has never been more acute. The ability to see across the breadth of their environment, of native and third-party services and dependencies, is crucial to correlating with certainty where things are going wrong and what power and recourse the organization has to make things right.
Mike Hicks is Principal Solutions Analyst at Cisco ThousandEyes.