Friday 24 September 2010

"Worst Outage in The History of Facebook" - Facebook Engineer Explains

Facebook Software Engineering Director Robert Johnson was kind enough to explain to a curious public exactly why Facebook went down earlier today, calling the mishap “the worst outage we’ve had in over four years.”
In a brief blog post, Johnson discussed today’s downtime, which began around 11:30 a.m. PST. The site wasn’t functioning again for most users until around 3 p.m. PST.
Today’s outage was unrelated to another period of downtime yesterday, when issues with a third-party networking provider caused problems for some users trying to connect to Facebook.
Johnson said the downtime today was caused by “an unfortunate handling of an error condition” involving an automated system designed to verify configuration values in the cache and replace invalid values with updated values from the persistent store.

Today we made a change to the persistent copy of a configuration value that was interpreted as invalid. This meant that every single client saw the invalid value and attempted to fix it. Because the fix involves making a query to a cluster of databases, that cluster was quickly overwhelmed by hundreds of thousands of queries a second.
To make matters worse, every time a client got an error attempting to query one of the databases it interpreted it as an invalid value, and deleted the corresponding cache key. This meant that even after the original problem had been fixed, the stream of queries continued.
The automated system for correcting configuration values has been turned off for now, and Facebook is reportedly exploring more, ahem, “graceful” methods of handling this in the future.
Johnson also notes that getting the feedback loop to stop was “quite painful,” saying that the entire site had to be turned off to stop traffic to a particular database cluster.
We don’t envy Facebook the at-scale disaster the site has just survived; 500 million users and a feedback loop adds up to some nasty business however you slice it. And Facebook’s downtime problems aren’t nearly as persistent and severe as those of other social media staples out there.


Related Articles
Facebook Outage - Shown in Graph

1 comments:

Post a Comment