The cloud has become an important component of almost every organization's business strategy. Companies rely on cloud computing to power modern applications and SaaS-based services to enable critical business functions. While most of the cloud providers have invested heavily in resilient operations, outages still occur from time to time. It's important to note that cloud downtime can occur for several reasons. It might be something at the cloud provider, but it could also be a network issue, DNS problem, security event, or other.
The reality is cloud outages happen from time to time. The important thing is to understand why they occurred and what can be learned from them. Recently, ThousandEyes, a Cisco company, hosted a webinar focused on major cloud outages that disrupted business operations in 2022. The network intelligence company recorded more than 15,000 outage events last year using its network-agnostic data that provided insights across the Internet and into the cloud. Having collected the insights, ThousandEyes put together a recap to help companies plan proactively and mitigate downtime in the future.
For those that do not know ThousandEyes, the company offers a complete view of the cloud ecosystem, including SaaS services, IaaS platforms, network connectivity, content delivery network (CDN) performance, WAN visibility, and more. This provides a view of the Internet and cloud through the lens of the user enabling businesses to optimize user experience, a critical point of differentiation today.
“For those of us who rely on cloud-hosted services, we realize that the infrastructure has been chopped up, further distributed, decentralized, and split into smaller components to handle compute, storage, security, networking—all the pieces that rely on APIs to talk to one another. So, when problems arise, it can impact everything,” said Chris Villemez, senior technical marketing engineer at ThousandEyes.
Here are the top five outages ThousandEyes identified in 2022, based on impact:
- Twitter (March 28, 2022): Users couldn’t reach Twitter after a Russian internet/satellite communications provider RTComm announced one of Twitter’s prefixes and blackholed traffic as a result. While most Twitter users outside of Europe experienced no disruption during this Border Gateway Protocol (BGP) hijack, some users couldn’t reach Twitter for approximately 45 minutes until RTComm withdrew the erroneous route.
ThousandEyes lesson learned: Though your company might have RPKI (resource public key infrastructure) implemented to fend off BGP threats, it's possible that your telco won't. Something to consider when selecting ISPs.
- Rogers Communications (July 8, 2022): An internal routing error caused the Canadian communications company to withdraw its prefixes, causing the provider to be unreachable across the Internet for almost 24 hours. The outage impacted millions of users across Canada.
ThousandEyes lesson learned: No provider is immune to outages, no matter how large. So, for crucial services like hospitals and banking, plan for a backup network provider that can alleviate the length and scope of an outage.
- Google (August 9, 2022): An outage affected Google Search and Google Maps, where both became unavailable to users worldwide. Apps using Google's software function also stopped working during the outage, which lasted approximately 60 minutes. During that time, Google web servers showed HTTP 500 Internal Server Error messages, 502 bad gateway errors, and timeouts.
ThousandEyes lesson learned: It is important to monitor not just your application frontends but also the performance-critical dependencies that power your app.
- British Airways (February 25, 2022): An outage caused hundreds of flight cancellations and operations interruptions at the airline's London Heathrow hub, one of the busiest international airports. The incident took place when application servers became unresponsive. The root cause was likely a central backend repository that multiple front-facing services rely on.
ThousandEyes lesson learned: Architecting backends that avoid single points of failure can reduce the likelihood of a chain of events, like the one experienced by British Airways, that can ground your entire fleet.
- Zoom (September 15, 2022): A brief outage impacted global users who couldn't log in or join meetings. The HTTP errors indicated potential CDN issues. The root cause appeared to be in Zoom's backend systems around resolving, routing, or redistributing traffic.
ThousandEyes lesson learned: It may be that the app itself is causing issues rather than the network. Having visibility into which it is can prevent confusion and finger-pointing during root cause analysis.
Additionally, ThousandEyes detected several other outages involving Amazon Web Services (power failure and packet loss); Atlassian (service unavailable/data loss); Zscaler (network traffic loss); and WhatsApp (failure to send/receive messages).
The webinar wrapped up with a summary of key lessons and take aways. These were:
- BGP powers the Internet but can also be misused and abused. Visibility and planning are needed to protect the network.
- Public cloud is ubiquitous and reliable. Ensure you are monitoring all cloud dependencies.
- Avoid single points of failure. Apps are only as resilient as the architecture.
- Security is essential, but it can add great complexity that requires continuous end-to-end visibility.
- Whenever the infrastructure is touched, failures can occur. Visibility is critical before and after each network change to avoid impacts.
Prior to being an analyst, I spent many years as a network engineer, and I couldn't agree more on the point of visibility, as one can't manage what one can't see. It's important to understand that visibility needs to be end-to-end, which means from the user device to the cloud and everything in between. There are many monitoring tools, which claim to be a "single pane of glass" but only monitor part of the network, creating what's known as "swivel chair management," where each dashboard monitors part of the network, and then the engineer must correlate the information manually. This is why the majority of application problems, 74% according to my numbers, are reported by the end user before IT is aware.
The cloud is here to stay, and outages will happen. IT pros need to choose their cloud providers carefully but also ensure network resiliency and visibility are in place to recover from the problem as quickly as possible.
Zeus Kerravala is the founder and principal analyst with ZK Research.
(Read his other Network Computing articles here.)