What you don’t know about Facebook’s Outage
Date: 4 February 2022
Facebook had an outage in October 2021. And everyone heard about it. Even those who were living under a rock. Because, chances are, regardless of your location, you use Facebook Messenger or WhatsApp to communicate.
The impact? We all had a very productive six hours. Because, six hours is how long Facebook, Instagram and WhatsApp were down for.
Facebook attributed the outage to faulty configuration changes to the company’s routers. This beckons the question - if something like this could happen to Facebook, how safe is an average business that doesn't have Facebook’s technological or financial resources.
Amar Singh, CEO and Co-Founder of Cyber Management Alliance and Dawid Kowalski, Senior Technical Technical Director, EMEA, FireMon recently put their heads together to answer this question and others around the Facebook outage.
In a webinar entitled, “What You Don’t Know About Facebook’s Outage?,” the two cybersecurity experts unpacked some interesting aspects of the incident that we can all learn from.
Key topics covered in the webinar:
- What really happened in the outage?
- Was it a ‘good’ or a ‘bad’ outage?
- The real threat behind such outages.
- How can such outages be prevented?
The crux of the discussion on the webinar was simple: Don’t underestimate the ability of cybersecurity foundations to completely wreck your business. It’s not always the advanced Nation State actors that will bring ruin to your reputation or bottom line. Sometimes, a simple human error or a faulty process can be as damaging.
And damaging it was for Facebook. For organisations the size of Facebook, every second of downtime means millions of dollars lost.
This particular outage was almost unbelievable for most users. Even the ‘Login with Facebook’ service, which millions depend on for using other apps, was down. Businesses were not able to advertise with Facebook and Instagram which means the company lost an estimated $100 million in ad revenue. Facebook shares fell by 5% which means that $40 billion got wiped out in a matter of a few hours!
So what exactly was behind the Facebook outage?
As mentioned above, it was a faulty change management process.
Dawid explained in the webinar that every single organisation does and should have a change management process. These change processes, however, are sometimes just a tick box exercise. Very often, these processes don’t include any simulation or analysis of what will happen if that change is made. That’s probably where Facebook failed too.
The company said in its announcement, the outage “impacted many of the internal tools and systems”.
Several organisations today, like Facebook, try to build internal tools but since they don’t specialise in building such tools, there are invariably some flaws and some areas that remain uncovered. In such a case, the IT staff tries to move fast. They know something is broken so they try to resurrect it and that’s when the problem starts to happen - when processes aren’t followed properly.
This is exactly what happened in the case of Facebook - A simple maintenance/configuration change caused Facebook to create ripples and news across the globe.
Dawid gave a brief overview of how he perceived this to happen during the webinar. Facebook did plan for change - they had some scripts for the risk assessment of the change but the scripts didn’t spot the problem. Many people speculate that they didn’t actually have the scripts for that specific type of change because the change was about a minimal impact - removing some of the network connectivity within the environment.
For the detailed explanation, tune into the webinar at 23:00 minutes.
What do we need to know about effective Change Management?
Configuration management/change management can appear to be a boring topic but it’s actually very important. If not managed and tested properly, it can be a big problem as was seen in the outage in question.
In the current environment, all networks and environments started as much smaller networks with one switch which later became multiple switches. One firewall evolved into multiple firewalls.
Any planned change, therefore, is impossible to analyse by a human being. It’s too huge for a human to do it and to correlate it with the security requirements.
As the two experts highlighted in the webinar, the human element needs to be eliminated when trying to analyse a change. The human knows that a change needs to be made and that’s where their role should end. You then need automation that can handle the complexity of IT and cyber. If you don’t have automation, humans are going to fail.
The 4 key terms for effective change management then are complexity, visibility, the human and automation. These 4 terms clearly explain the problem and the solutions to the problem in a Facebook-like outage situation.
How to Protect your Organisation from Network Outages?
Misconfiguration is usually what causes network outages. Basically, traditional approaches to managing network security policy inhibit your company’s ability to innovate and adapt to change.
Here are the five steps the experts offered on the webinar for protecting your organisation from such a network outage:
- Conduct an assessment of your security policies & clean up
- Streamline and accelerate security with automation
- Gain visibility of your network across cloud and on-prem environments
- Visualise the impact of security policy changes before you apply them
- Integrate your security tools to maximise their performance
To conclude, Amar reiterated that integration of the technology stack is really critical. If you’re not able to automate, integrate and visualise, you are at the mercy of luck when it comes to your network security.
Visibility and scalability is key - the network security policy platform that you choose should be flexible and capable of securing your networks as they get larger and more complex, while maintaining desired workflows.
Is this type of an outage perceived as a ‘good one’?
While no outage can be labelled as ‘good’, this one is not considered bad in technical terms because it wasn’t caused by malicious outsiders.
A bad outage would typically be a ransomware attack or any attack that leads to a data breach or loss of customer data.
The Facebook outage didn’t lead to data compromise and it was a good lesson for everyone even remotely invested in cybersecurity, making experts label it as a ‘good’ outage.
Watch the Webinar here.
Read more about this on the FireMon blog on the Facebook Outage.
To access similar high-quality and educational content, subscribe to the Cyber Management Alliance BrightTALK channel.