Top 10 Cloud Fiascos

(Source: Jo Naylor licensed under CC BY 2.0)
There aren't many cases where major cloud vendors have actually lost a significant amount of customer data (though sometimes they've needed to restore from tape), but in 2009, Microsoft's attempt to upgrade a SAN without backing up resulted in lost data for T-Mobile customers using the Danger Sidekick's cloud storage.

Honorable mention in the significant data loss category goes to the billing service provider Recurly, where a cascading hardware failure lost a large amount of billing information. This was especially painful because the company's core business proposition is storing that information so that customers don't have to do it themselves.
Perhaps the poster child of cloud outages is the Amazon Web Services failure on April 21, 2011, when much of its US-East region went down. Contrary to the company's vision and marketing about the lack of interdependence among datacenters, the failure was not restricted to a single availability zone within the region. It rendered useless servers across many different zones. Even companies that had based their architectures on AWS recommendations for high availability found themselves down. AWS has continued to improve its service to remove interdependence, but issues across availability zones have continued to surface, even recently.
In perhaps the largest cloud security issue ever, Dropbox password authentication was accidentally disabled for four hours, thus allowing anyone to log into any Dropbox account with any password. According to Dropbox, "less than 1 percent" of its 45 million users users logged in during that time. But even without significant breaches, the fact that such a thing could happen strikes fear into the heart of even the most ardent cloud supporter.

(Source: Drugice licensed under CC BY 3.0)
Some people think platform-as-a-service (PaaS) is the future, because it lets developers deploy their code to the cloud without worrying about things like installing software and optimizing performance. They simply buy more compute resources, and applications scale beautifully. That's certainly what Rap Genius, a well-funded Silicon Valley startup, thought it was getting with Heroku, one of the first and largest PaaS providers. Unfortunately, in early 2012, Rap Genius discovered that Heroku wasn't handling an increased volume of requests properly, resulting in slow page response times and a significant overpayment for services. Rap Genius has since moved to AWS. Heroku has changed its documentation -- but not how it handles increasing volumes of requests.
Dropbox may be the best representative of what is both great and terrifying about the cloud. It's easy and powerful, and it's not surprising that adoption has been viral. However, the security practices of its users have often left much to be desired. IT managers know users are likely to reuse weak passwords across sites and not give much thought to whether sharing a particular file in a particular way exposes it to the wrong people. Still, it was surprising to learn that a Dropbox employee not only reused a password across sites, but also put a sensitive file with Dropbox customer email addresses into a Dropbox account, where spammers nabbed it. This has hopefully given Dropbox a better idea of why so many IT departments are skeptical of rolling it out to their users.

A note about Dropbox: In case you think I am singling it out, there is much more I could note about the company's outages, automatically reading customer files, or other security issues that scare the hell out of chief security officers. But I don't want to bore you, and there are plenty of other debacles to share.

(Source: FutUndBeidl licensed under CC BY 2.0)
Every major cloud provider seems to deliver an outage once it gets enough customers. It's happened to Google App Engine, Azure (acutally, it's happened to Azure more than once), Rackspace, SoftLayer, GoGrid you get the picture. Individually, these are not fiascoes, but collectively, the fact that we simply cannot get away from unexpected outages in the cloud certainly qualifies. These mentioned outages were wide ranging enough to affect even some customers who had tried to create architectures that would function in the event of a failure.

(Source: Francesco Ugolini licensed under CC BY 2.0)
Christmas Eve 2012: Unfortunately, even the Chaos Monkeys at Netflix appear to have been asleep when it came to testing Amazon's Elastic Load Balancer. Many services running on AWS in the US-East region, including Netflix servers, were unavailable for more than 14 hours from Christmas Eve through Christmas morning. This was a fiasco for several reasons. it showed more cross-availability-zone interdependence than is supposed to exist. It was the result of a developer -- a developer -- deleting data from a production system. (Who's letting the developers near the production systems?) And Netflix has one of the most lauded and most admired cloud infrastructures (lot of good that did here). Fortunately, Amazon and Netflix have worked to make sure this particular outage doesn't happen again. But what other danger lurks inside the heart of US-East?

(Source: swanksalot licensed under CC BY 2.0)
Let's be clear: Gmail goes down more than users would like. And Office 365 -- a much newer offering -- is trying to catch up. Gmail promises 99% availability (e.g., it's OK to be down more than seven hours per month), whereas Office 365 promises 99.9% (e.g., it's OK to be down around 45 minutes per month). The fiascoes here are user-related. If you're depending on Gmail for mission-critical, time-is-of-the-essence business purposes, you're playing with fire, and you've probably already been burned.

(Source: Remko van Dokkum licensed under CC BY 2.0)
Many consumers love the cloud but don't like to be told they must connect to it every day, as if they were under house arrest with an ankle bracelet. Unfortunately, no one at Microsoft got this memo. The original version of the new Xbox One console included a phone-home (to Microsoft) requirement. The furious backlash caused a quick reversal by Microsoft, making users of the next-generation console able to unplug for days at a time without concern.

(Source: R. Pollard licensed under CC BY 2.0)

Tags:

Data Centers

Cloud Infrastructure

Search form

Top 10 Cloud Fiascos

Tags: