Storage

09:31 AM
Connect Directly
RSS
E-Mail
50%
50%

Facebook Blames Outage On Database Failure

Programming chief says lengthy downtime was traced to an automated system designed to detect and fix error conditions.

Facebook officials said the lengthy outage that hit the site Thursday was the result of a glitch in the social media network's database software.

"The key flaw that caused this outage to be so severe was an unfortunate handling of an error condition," said Robert Johnson, Facebook's director of software engineering, in a blog post.

Johnson said software that's designed to detect and fix such errors backfired, compounding the original problem. "An automated system for verifying configuration values ended up causing much more damage than it fixed," said Johnson.

"The intent of the automated system is to check for configuration values that are invalid in the cache and replace them with updated values form the persistent store," Johnson continued. "This works well for a transient problem with the cache, but it doesn’t work when the persistent store is invalid," he said.

"We made a change to the persistent copy of a configuration value that was interpreted as invalid. This meant that every single client saw the invalid value and attempted to fix it. Because the fix involves making a query to a cluster of databases, that cluster was quickly overwhelmed by hundreds of thousands of queries a second," said Johnson.

"As long as the databases failed to service some of the requests, they were causing even more requests to themselves. We had entered a feedback loop that didn't allow the databases to recover," said Johnson.

The glitch brought the world's most popular social networking site to a standstill for two and one-half hours Thursday. Johnson called it "the worst outage we've had in over four years."

Johnson did not state whether the company planned to offer credits or other compensation to advertisers who's campaigns were offline during the outage.

Facebook in just a few years has grown from founder Mark Zuckerberg's college project to a multibillion dollar global software empire that's beginning to rival Google and Microsoft on some fronts. As such, customers and users have become less forgiving when it comes to outages and other glitches as Facebook enters the tech industry's big league.

Johnson seemed to recognize that in his blog post. "We apologize again for the site outage, and we want you to know that we take the performance and reliability of Facebook very seriously," he said.

Comment  | 
Print  | 
More Insights
Hot Topics
3
Stand-Alone SSD Vendors: A Vanishing Breed
Howard Marks, Network Computing Blogger,  7/10/2014
1
Data Backup: Beyond Band-Aids
Tony Kontzer 7/11/2014
1
New Storage Platforms: What's the Difference?
Shahbaz Ali, CEO, Tarmin,  7/7/2014
White Papers
Register for Network Computing Newsletters
Cartoon
Current Issue
Video
Slideshows
Twitter Feed