Last month's global disruption of Microsoft cloud services, including Azure, Teams, and Outlook, was the latest in what is becoming an all-too-common occurrence of cloud outages. In this case, the cause was an innocent WAN router update gone wrong. But it highlights the point we've repeatedly made about the fragility of the world's global communications infrastructure.
In this latest incident, which lasted about two and a half hours, millions of users started experiencing network connectivity issues when trying to access the Microsoft cloud-hosted services. In a post-mortem explaining what happened, Microsoft noted: “a network engineer was performing an operational task to add network capacity to the global Wide Area Network (WAN) in Madrid. The task included steps to modify the IP address for each new router, and integration into the IGP (Interior Gateway Protocol, a protocol used for connecting all the routers within Microsoft’s WAN) and BGP (Border Gateway Protocol, a protocol used for distributing Internet routing information into Microsoft’s WAN) routing domains.”
It further noted that the company has an SOP (standard operating procedure) when making such changes. The SOP details a four-step process that includes testing the change in a network emulator; testing the change in a lab setting; a review documenting these first two steps, as well as roll-out and roll-back plans; and a safe deployment approach that only allows access to one device at a time to limit impact if there are any issues once an update is started.
Unfortunately, the SOP was changed before the scheduled update. Microsoft noted: “Critically, our process was not followed as the change was not re-tested and did not include proper post-checks per steps one through four. This unqualified change led to a chain of events that culminated in the widespread impact of this incident.”
What happened? The change added a command to purge the IGP database – however, Microsoft noted that the command operates differently for different router manufacturers. “Routers from two of our manufacturers limit execution to the local router, while those from a third manufacturer execute across all IGP joined routers, ordering them all to recompute their IGP topology databases.”
The change resulted in two cascading events. First, routers within the Microsoft global network started recomputing IP connectivity throughout the entire internal network. Second, because of the first event, BGP routers started to readvertise and validate prefixes received from the Internet. Due to the scale of the network, it took approximately 1 hour and 40 minutes for the network to restore connectivity to every prefix, according to Microsoft.
Actions taken to avoid another similar outage
Configuration changes and DNS issues have been the source of multiple major outages over the last two years. And everyone knows there will be more to come.
“What the recent failures from Internet giants demonstrate is that the question of the next outage is not if, but when,” says Dritan Suljoti, Chief Product and Technology Officer of Catchpoint. “Moreover, the downstream effect of major outages to essential Internet infrastructure, such as cloud platforms, CDNs, or DNS providers, means that no company is immune, no matter how well prepared they think they are." (Suljoti's comments came in a press release about the company's new report on "Preventing Outages in 2023: What we Can Learn From Recent Failures.")
So, what are the cloud providers doing to address the problem? Looking at this most recent outage provides some insights about strategies.
First, problem detection is crucial. The sooner a cloud or service provider knows there is an issue, the faster it can troubleshoot and resolve the problem. With the recent outage, Microsoft said monitoring systems detected DNS and WAN-related troubles seven minutes after they began.
Second, methods and best practices must be developed and followed to avoid outages outright. Again, with the latest outage, Microsoft outlined several actions it is taking to prevent a repeat of the problem.
One issue that contributed to the outage was a change to a standard operating procedure. That change was not properly revalidated and left the procedure containing an error. To address this issue, Microsoft will audit all SOPs still pending qualification, and it will try to improve the process by conducting regular, ongoing mandatory operational training and confirmation of following all SOPs.
Another issue was that a standard command with different behaviors on different router models was issued outside of standard procedures. That caused all WAN routers in the IGP domain to recompute reachability. Going forward, Microsoft will audit and block similar commands that can widely impact all three vendors’ WAN routers.