What’s Causing Cloud Outages? A Network Managers’ Guide

From fat-finger errors to fishing boats, here are the leading reasons cloud outages at AWS, Microsoft, and others are a growing network resilience challenge.

What’s Causing Cloud Outages? A Network Managers’ Guide
(Credit: Fabrizio Fadda / Alamy Stock Photo)

As enterprises rely more and more on cloud services to meet their network infrastructure, compute, data storage, and security needs, cloud computing outages have a significant impact on operations.

Many believe (or hope?) that moving services to the cloud would eliminate some issues. After all, you would assume cloud providers make use of the latest technologies, have staff with expertise in these technologies, and build in lots of redundancy.

Unfortunately, what we find is that cloud outages have a lot in common with their data center outage counterparts. Many occur due to human error, power outages, malicious acts, Mother Nature, or plain bad luck. 

cloud outages

cloud-1-2J0NPBA.jpg

What’s causing cloud outages?

There are several common culprits causing cloud outages. Over the last few years, we have seen examples of each. All have had a significant impact on the enterprises using the services. Here are some of the top problems that keep reoccurring.

Configuration mistakes

We’re in the age of graphical user interfaces (GUIs) and automation. Yet, many critical IT chores like deploying a new server, provisioning storage for an application, or setting up new router tables are done manually via command line interfaces (CLIs). As one would expect, that can lead to configuration mistakes.

That is often the case with cloud outages. One such mistake caused a six-hour outage of Facebook, Instagram, Messenger, Whatsapp, and OculusVR due to a routing protocol configuration issue. As we wrote at that time: “The outage was the result of a misconfiguration of Facebook’s server computers, preventing external computers and mobile devices from connecting to the Domain Name System (DNS) and finding Facebook, Instagram, and Whatsapp.”

Essentially, BGP routers were unrecognized, preventing traffic destined for Facebook networks from being routed properly. Resolution of the problem was more challenging than normal because not only was communication between routers interrupted, but so too, were DNS traffic and all applications.

The problem here was that everything ran over the same network. As a result, IT staff could not remotely correct the problem because they could not access the impacted systems. And making matters worse, IT staff were locked out of facilities because their access control system also ran over the same network.

Unexpected or unknown system behavior

Obviously, an incorrect configuration change can cause outages. But in one recent case, a correct change still led to a major outage. The reason, unbeknownst at that time to the IT staff, was that the same command operates differently on routers from different vendors. That was the case in an extensive Microsoft outage.

In that event, a network engineer was performing relatively common tasks to add routers and capacity to the company’s Wide-Area Network. The work involved modifying the IP address for the new routers and integrating them into the IGP (Interior Gateway Protocol, a protocol used for connecting all the routers within Microsoft’s WAN) and BGP (Border Gateway Protocol, a protocol used for distributing Internet routing information into Microsoft’s WAN) routing domains.

As we wrote at that time: “The change added a command to purge the IGP database – however, Microsoft noted that the command operated differently for different router manufacturers. Routers from two manufacturers limit execution to the local router, while those from a third manufacturer execute the change across all IGP joined routers, ordering them all to recompute their IGP topology databases.” Due to the scale of the network, that work took everything offline for about two and a half hours while the routing tables were recalculated and updated.

Power issues

One of the main factors when selecting sites for cloud data centers is the availability of abundant and low-cost electricity. Why? Data centers of any type, whether in the enterprise or for a cloud provider, consume 10 to 50 times more electricity per square foot than a typical commercial building, according to the U.S. Department of Energy. As a result, the major cloud providers have clustered their data centers in regions like the Pacific Northwest (known for its low-cost hydroelectric power), Arizona, Virginia, and other comparable places.

Even with such an abundance of power, power-related outages account for 43% of all data center outages, according to the Uptime Institute. Naturally, cloud data centers, like their enterprise data center counterparts, have backup power capabilities in the event of an outage to the electrical network. Unfortunately, that may not be enough. One of the longest cloud service disruptions, a 12-hour outage of Microsoft's Virginia data center, was due to a problem with a provider's redundant power system.

East Coast companies served by that data center were unable to access any of their Microsoft services. As we noted in a roundup article about cloud outages, the source of the problem was that the facility’s redundant power system created unexpected electrical transients. Air handling units designed to cool the center detected the oscillation and shut themselves down to prevent damage. Once the source of the problem was identified, the units had to be manually reset to restore services at the facility.

Physical damage

In the days and years after the breakup of AT&T, when carriers were expanding their networks, there were numerous stories of major outages due to backhoes cutting cables. Those types of outages have greatly diminished in recent years due to a greater awareness of the issue, better mapping of underground cables, and more.

Lately, it has been Mother Nature doing the damage. Last year, a volcanic eruption took out the only connection between Tonga and the outside world. The blast cut the submarine cable linking the island with Fiji.

The story brought attention to the issue, noting that 95 percent of intercontinental global data traffic travels over undersea cables that run across the ocean floor. And worse…many of the most concentrated terminations of such cables are in areas subject to earthquakes, volcanic eruptions, and flooding.

That latter point was an issue in 2012 when post-tropical cyclone Sandy’s landfall points and associated tidal surges along the New York and New Jersey coasts aligned with the termination points of 25 submarine cable systems. The storm cut 11 of the 12 high-capacity cables that connected the US and Europe.

Accidents and malicious actions

As noted, the fragility of the network of undersea cables is of great concern. Beyond the acts of nature mentioned above, the cables and especially concentrated termination points are ripe for terrorist or nation-led attacks.

But a more common problem is accidental cable cuts at sea. Ships, particularly fishing vessels, will anchor at sea during severe weather. In some cases, the ships are displaced by strong winds or currents. That drags their anchors across the ocean floor resulting in damage to a cable.

Secondary or unintended impacts

Most cloud service outages directly impact access to an application or suite of applications or services. For example, a Microsoft center outage might mean enterprises cannot access their Outlook, Sharepoint, and Teams apps. Or a Facebook outage also cuts off access to Instagram, Messenger, and Whatsapp.

But things are getting more complicated as many cloud services are now dependent on other services. That was the case when an Amazon outage inhibited and interfered with the invocation of its AWS Lambda function. As we wrote at the time, that was a major problem because many AWS services and enterprises are making use of AWS Lambda's serverless capabilities. The problems with Lambda cascaded, taking more than 100 AWS services offline.

cloud outages

cloud-2-2G83Y0T.jpg

How to protect against cloud outages

There are several ways for cloud service providers to minimize the chances of an outage and for enterprises to minimize their impact.

Cloud providers are taking a number of steps. Many are trying to improve outage detection. Most are clarifying methods, developing best practices, and implementing standard operating procedures (SOPs) for things like router configuration changes or adding equipment to scale their services.

More advanced providers are continuously auditing those SOPs. They want to be sure they are being carried out and that the procedures are still correct, given the dynamic nature of cloud environments.

Additionally, the larger cloud providers make use of redundant everything. They use multiple circuits and cables to carry traffic between centers and for users to reach their centers. They have hot backups standing by to take over and run applications and services if there is an outage. Additionally, they make use of different power supplies, including traditional line-delivered power, on-site generation, on-site uninterruptible power systems, etc.

From an enterprise perspective, though, network managers’ defenses against cloud outages remain limited.

Enterprise network managers’ first step is to see what their cloud service providers are doing in these areas to make their services resilient. Other tactics to take include:

  • Using multiple providers for similar services

  • Paying for premium services that guarantee higher availability or that can automatically route workloads from one center to another in case their primary center has problems

  • Using monitoring and observability tools and services to better understand how an outage will impact them.

cloud outages

cloud-3-2K4P331.jpg

Cloud outages key headlines

  • Tonga Volcano Highlights Global Undersea Cable Network Fragility – The Tonga communications disruption caused by a volcanic eruption got the world’s attention. It highlighted the fragility of the global undersea cable network, which carries 95 percent of intercontinental data traffic, and can easily go offline due to accidental cuts, malicious damage, and damage caused by natural disasters like hurricanes, tsunamis, and other incidents.

  • A Deep Dive into the Recent Microsoft Cloud Outage – Configuration changes and DNS issues have been the source of multiple major outages in recent years, including a major Microsoft Cloud outage. In fact, major failures from the Internet giants demonstrate that the question of the next outage is not if but when. And sadly, these outages have significant downstream effects on essential Internet infrastructure, such as cloud platforms, CDNs, or DNS providers.

  • Lessons Learned from Recent Major Outages – The nature of today’s more interconnected business world makes cloud infrastructure and service disruptions more damaging. The main thing enterprises can do to minimize the impact of outages is to better understand the work providers and organizations like ICAAN are doing to reduce outages in the future.

  • How to Avoid Network Outages: Go Back to Basics – While there's a lot of hype about hacking and DDoS attacks, the reality is most network outages are caused by an organization’s own people. Following best practices can go a long way toward preventing unplanned downtime caused by internal errors as well as external attacks.

  • BGP Config Change, Not Cyber Attack, Brought Down Facebook – A six-hour outage of Facebook, Instagram, Messenger, Whatsapp, and OculusVR resulted from a routing protocol configuration issue and not due to a cyberattack. Enterprise IT takes-aways from the outage: Tread carefully when making BGP config changes and avoid putting everything (DNS, apps, and more) on one network.

  • 10 Reasons Data Centers Fail – Operators sometimes make common mistakes that can lead to data center outages. Whether the root cause is a hardware failure, software bug, or human error, most failures can be prevented. With the high level of redundancy built into today's data center architectures, prevention is very much possible.

  • Lessons Learned From the Top Cloud Outages of 2022 – The cloud has become an important component of almost every organization's business strategy. Yet, cloud outages happen all the time. To reduce the impact, IT pros need to choose their cloud providers carefully but also ensure network resiliency and visibility are in place to recover from the problem as quickly as possible.

  • Geopolitics and Climate Change Heighten Undersea Cable Concerns – “The cloud is not in the sky; it is under the sea.” That was a comment from an author of a government study to assess new potential disruptions of undersea communications cables. The report found that global political unrest and climate change are bringing new attention to the fragility of the undersea cable networks that carry about 95% of international digital traffic.

  • 2019 in Review: The Biggest Internet Outages of the Year – Enterprises are increasingly relying on Internet transport to connect their sites and reach business-critical applications and services. Over the last year few years, several large-scale outages had ripple effects across the global Internet, impacting enterprises and consumers alike. Here are some of the most disruptive outages over the last few years and what can be learned from them.

  • Delta Outages Reveal Flawed Disaster Recovery Plans – Outages at Delta, United, and Southwest drew attention to the patchwork and often outdated nature of IT systems that power many airlines and businesses in other industries, which will no doubt contribute to future failures. While occasional mishaps are unavoidable, a little planning and investment in infrastructure can help companies sidestep or at least more quickly recover from similar IT challenges.

  • What Can Network Managers Do About Cloud Outages? (Not Much) – Over the last year or so, major outages at cloud, Internet, and content delivery network providers significantly disrupted operations at businesses of all sizes. Better observability tools can help net managers maintain some resilience to cloud service outages, but provider misconfigurations and DNS infrastructure issues are out of their control.

  • Ensuring Resilient Connectivity During the Holiday Rush – Black Friday, Cyber Monday, and the holiday season are always critical for retailers' bottom lines. Unfortunately, the holiday period poses a significant risk of outages that could result in lost revenue. As retailers prepare for the holiday rush, here are a few ways to improve resilience, mitigate the impact of potential outages, and ensure customers have optimal e-retail experiences.

  • The Scourge of Global Internet Outages Continues – Over the last few years, it seemed that nobody escaped the onslaught of outages. Making matters worse, many companies, as well as most of the top SaaS providers, don’t have a fallback DNS option. A single outage could completely take their businesses offline.

Read more about:

Outage

About the Author

Salvatore Salamone, Managing Editor, Network Computing

Salvatore Salamone is the managing editor of Network Computing. He has worked as a writer and editor covering business, technology, and science. He has written three business technology books and served as an editor at IT industry publications including Network World, Byte, Bio-IT World, Data Communications, LAN Times, and InternetWeek.

SUBSCRIBE TO OUR NEWSLETTER
Stay informed! Sign up to get expert advice and insight delivered direct to your inbox

You May Also Like


More Insights