The six-hour outage of Facebook, Instagram, Messenger, Whatsapp, and OculusVR resulted from a routing protocol configuration issue and not due to a cyber attack. The outage was Facebook’s largest since 2019, when the site was down for more than 24 hours.
After finding the source of the problem and restoring services, the company discussed the root cause of the problem in a blog, noting:
“Our engineering teams have learned that configuration changes on the backbone routers that coordinate network traffic between our data centers caused issues that interrupted this communication. This disruption to network traffic had a cascading effect on the way our data centers communicate, bringing our services to a halt.”
Some of the key lessons learned for enterprise users from the incident include:
- BGP (Border Gateway Protocol) configuration changes are prone to mistakes
- Tread carefully when making configuration changes to core routers
- If possible, do not run all services and apps on one network.
That last point proved to be quite important. “The underlying cause of this outage also impacted many of the internal tools and systems we use in our day-to-day operations, complicating our attempts to quickly diagnose and resolve the problem,” said the company in its blog.
For example, there were reports that technical staff could not enter buildings where fixes were needed because the physical access control system was inaccessible.
A deeper look at the problem
The outage was the result of a misconfiguration of Facebook’s server computers, preventing external computers and mobile devices from connecting to the Domain Name System (DNS) and finding Facebook, Instagram, and WhatsApp.
And while the BGP protocol helps exchange routing information over the internet, DNS plays a central role in orchestrating all internet and application traffic. That was at the heart of the problem in the Facebook outage.
Technically, BGP routers were revoked, preventing traffic destined for Facebook networks to be routed properly, including traffic to their DNS servers hosted on their networks. This kind of misconfiguration error is not uncommon. As is the case in many networking environments, one way to reduce manual errors here is to automate management and changes.
Perhaps the biggest lesson enterprise IT managers should take away from this outage is for companies to “avoid putting all of their eggs into one basket," said Chris Buijs, EMEA Field CTO at NS1. “In other words, they should not place everything, from DNS to all of their apps, on a single network.”
Additionally, companies should use a DNS solution that is independent of their cloud or data center. If the provider goes down, a company will still have a functioning DNS to direct users to other facilities, which builds resiliency into the entire application delivery stack.
Read additional Informa coverage of the outage:
5 Lessons from Facebook, Instagram, WhatsApp Outage