Amazon S3 Crash Raises Doubts Among Cloud Customers

The company's Simple Storage Service suffered a similar outage lasting about two hours in February, an incident that at the time led many to question its dependability.

Thomas Claburn

July 21, 2008

4 Min Read
Network Computing logo

Amazon Web Service's Simple Storage Service (S3) suffered a service failure for about eight hours on Sunday, causing outages at online companies that depend on S3 for file storage.

S3 suffered a similar outage lasting about two hours in February, an incident that led many to question the dependability of the increasingly fashionable cloud computing model.

Amazon in a statement said that it is proud of S3's operational performance over the past two-plus years and that customers generally have been pleased. "But any downtime is unacceptable and we won’t be satisfied until it is perfect," Amazon's statement said.

InformationWeek Reports

"As a distributed system, the different components of S3 need to be aware of the state of each other," Amazon said in its statement, "For example, this awareness makes it possible for the system to decide which redundant physical storage server to route a request to. We experienced a problem with those internal system communications, leaving the components unable to interact properly, and customers unable to successfully process requests. After exploring several alternatives, the team determined it had to take the service offline to restore proper communication and then bring service online again. These are sophisticated systems and it generally takes a while to get to root cause in such a situation -- we will be providing our customers with more information when we've fully investigated the incident."

Since S3 was launched in March 2006, a variety of companies have outsourced at least some of their storage infrastructure to AWS, including 37signals, YouOS, SmugMug, ElephantDrive, and Jungle Disk.

Don MacAskill, CEO of SmugMug, which uses S3 to store its customers' photos, was quick to defend AWS. "Amazon's S3 service, SmugMug's primary storage provider, is currently experiencing problems," he wrote in a blog post on Sunday. "As a result, a large portion of the photos and videos stored on SmugMug are currently offline. Historically, Amazon has been very stable. We've seen three of these in our entire [two-plus year] history with Amazon, including this one. I expect, like the last two, that service will be restored shortly."

MacAskill stressed that his faith in AWS hasn't been shaken, and that such outages "few and far between, short, and handled properly."

MacAskill has been a strong supporter of AWS since its inception, which may explain why Amazon has pointed to SmugMug as a customer reference over the past two years and continues to feature the company on one of its Customer Case Studies pages.

According to an Amazon official, there's no marketing relationship or quid pro quo between AWS and SmugMug.

In the AWS S3 forum, other S3 users were more critical.

"I've been using S3 to serve up Flash files for about six months, and am now relying more and more on my own infrastructure because of the issues that I have had over the last two months," wrote someone posting under the name "David Campano."

"I was under the apparently false impression that S3 was a high-availability service," wrote someone posting under the name "iehiapk." "We may have to evaluate other services now. This makes us look like a bunch of amateurs."

"I am going to stick with them more than likely, but I will be implementing another service similar to theirs that we can fall back on in cases like this," said someone posting under the name "M. McQuade." "We already have an internal backup, but that will be harder to scale when our storage needs increase. We'll need to have two 'S3s.'"

Other S3 users complained about the S3 Service Level Agreement (SLA), which requires that customers who want a service credit have to apply for the refund via e-mail. "We should not have to e-mail Amazon to receive SLA credit," said someone posting under the name "Sam Beckett." "It should be automated for everyone."

The S3 SLA specifies that for monthly uptime of less than 99%, customers are entitled to a service credit of 25%. Amazon couldn't provide a single monthly uptime figure because the number varies for each customer. With 744 hours in July, any outage longer than 7.44 hours means monthly uptime of less than 99%.

According to the S3 status page, all is well at the moment.

About the Author(s)

SUBSCRIBE TO OUR NEWSLETTER
Stay informed! Sign up to get expert advice and insight delivered direct to your inbox
More Insights