An AWS outage having to do with an AWS Lambda function invocation impacted more than 100 services yesterday. The impact was felt across administrative, management, and functional services, including Amazon Relational Database Service, AWS Single Sign-On, AWS Identity and Access Management, AWS Certificate Manager, and more.
When and where did the outage occur?
The incident was first noted around 3 p.m. ET and resolved by 6:30 p.m. ET. It centered in the North Virginia facility and impacted numerous businesses on the East Coast served by this center. According to AWS, "We experienced increased error rates and latencies for multiple AWS Services in the US-EAST-1 Region.”
AWS narrowed down the root cause to be an issue with a subsystem responsible for capacity management for AWS Lambda, which caused errors directly for customers (including those using an API Gateway) and indirectly through other AWS services. Additionally, some users experienced authentication or sign-in errors when using the AWS Management Console or when trying to authenticate through Cognito or IAM STS. (Compounding matters all the more enjoyable, some customers experienced issues when attempting to initiate a Call or Chat to AWS Support.)
By about 4:40 p.m. ET, the underlying issue with the subsystem responsible for AWS Lambda was resolved. It then took several hours to process the backlog of asynchronous Lambda invocations that accumulated during the event.
Pervasive AWS Lambda use made outage extensive
Multiple businesses and organizations, such as the Boston Globe and New York’s MTA, reported problems via Twitter.
Why the far-reaching impact across so many AWS services? Serverless computing, such as that offered by Amazon Lambda, is rising as organizations move to the cloud or modernize their applications by adopting cloud-native architectures.
Specifically, AWS Lambda is a serverless, event-driven compute service that lets enterprises run code for virtually any type of application or backend service without provisioning or managing servers. A company can trigger Lambda from over 200 AWS services and software as a service (SaaS) application and only pay for what they use.
As such, it is widely used. In fact, two in three companies are adopting serverless Lambda functions, according to Steve Dietz, field CTO at Sumo Logic, in an online talk. So, the outage and degraded performance had a double whammy. More companies are using serverless functions, and most of the cloud services they are incorporating into their applications and infrastructure are based on serverless capabilities.
A post-outage analysis
Unlike many of the previous cloud outages of the last year, this incident did not seem to be caused by a configuration error. The cause of some of those past events included a faulty configuration change (related to Border Gateway Protocol) on the backbone routers and a configuration change that impacted a provider's load-balancing systems. And some incidents were power-related.
In this case, it may have been an issue of limited capacity or excessive usage. AWS reported that it was experiencing increased error rates and latencies for multiple AWS Services, with the root cause as an issue with services invoking AWS Lambda.