"The key flaw that caused this outage to be so severe was an unfortunate handling of an error condition," said Robert Johnson, Facebook's director of software engineering, in a blog post.
Johnson said software that's designed to detect and fix such errors backfired, compounding the original problem. "An automated system for verifying configuration values ended up causing much more damage than it fixed," said Johnson.
"The intent of the automated system is to check for configuration values that are invalid in the cache and replace them with updated values form the persistent store," Johnson continued. "This works well for a transient problem with the cache, but it doesn’t work when the persistent store is invalid," he said.
"We made a change to the persistent copy of a configuration value that was interpreted as invalid. This meant that every single client saw the invalid value and attempted to fix it. Because the fix involves making a query to a cluster of databases, that cluster was quickly overwhelmed by hundreds of thousands of queries a second," said Johnson.
"As long as the databases failed to service some of the requests, they were causing even more requests to themselves. We had entered a feedback loop that didn't allow the databases to recover," said Johnson.
The glitch brought the world's most popular social networking site to a standstill for two and one-half hours Thursday. Johnson called it "the worst outage we've had in over four years."
Johnson did not state whether the company planned to offer credits or other compensation to advertisers who's campaigns were offline during the outage.
Facebook in just a few years has grown from founder Mark Zuckerberg's college project to a multibillion dollar global software empire that's beginning to rival Google and Microsoft on some fronts. As such, customers and users have become less forgiving when it comes to outages and other glitches as Facebook enters the tech industry's big league.
Johnson seemed to recognize that in his blog post. "We apologize again for the site outage, and we want you to know that we take the performance and reliability of Facebook very seriously," he said.