Skip to main content

Facebook Outage 4th September 2021.


wolff.mateus
Forum|alt.badge.img+11

As everybody knows, yesterday Facebook has the biggest outage ever register in the company. 

 

Just to be clear it has noted that the problem is about DNS and BGP protocol through the network. So, it wasn't attack. It wasn't ransomware. It wasn't hardware problem.

 

I think that is important to talk about these big problems with IT companies, because is very common hear and see so many fake news about this kind of outage.

 

Here is a good post of CloudFlare about the problem:

 

https://blog.cloudflare.com/october-2021-facebook-outage/ 

 

 

5 comments

Mildur
Forum|alt.badge.img+12
  • Influencer
  • 1035 comments
  • October 5, 2021

Thanks for sharing the link. It gives a good insight what happened yesterday evening.


MicoolPaul
Forum|alt.badge.img+23
  • 2358 comments
  • October 5, 2021

There will inevitably need to be some serious questions asked that a core config change can impact the whole organisation in this way! They certainly found a single point of failure within their platform!

 

Interestingly all Facebook’s LAN traffic was impacted by this as well hence staff couldn’t gain access to the required locations with their badges etc, it seems their external and internal routing is all driven by the same BGP configuration, which is a painful fault domain for them to have.

 

I know it’s a lot easier to sit this side of the disaster and make comments, which isn’t my intent, but I hope they don’t blame the person that managed to make the change and instead collectively prevent that scenario from being possible in the future!


BertrandFR
Forum|alt.badge.img+8
  • Influencer
  • 527 comments
  • October 6, 2021

https://engineering.fb.com/2021/10/05/networking-traffic/outage-details/

I was waiting the post mortem outage from Facebook for science purpose :grin:


wolff.mateus
Forum|alt.badge.img+11
  • Author
  • Veeam Vanguard
  • 534 comments
  • October 6, 2021
MicoolPaul wrote:

There will inevitably need to be some serious questions asked that a core config change can impact the whole organisation in this way! They certainly found a single point of failure within their platform!

 

Interestingly all Facebook’s LAN traffic was impacted by this as well hence staff couldn’t gain access to the required locations with their badges etc, it seems their external and internal routing is all driven by the same BGP configuration, which is a painful fault domain for them to have.

 

I know it’s a lot easier to sit this side of the disaster and make comments, which isn’t my intent, but I hope they don’t blame the person that managed to make the change and instead collectively prevent that scenario from being possible in the future!

I just really hope that the person who runs the problem don't get fired. How can we see on the link that @BertrandFR share with us, the problem apparently was cause by a single person. But I consider that this trouble was fault by so many people or team.
 


vAdmin
Forum|alt.badge.img+2
  • Influencer
  • 166 comments
  • October 7, 2021
wolff.mateus wrote:
MicoolPaul wrote:

There will inevitably need to be some serious questions asked that a core config change can impact the whole organisation in this way! They certainly found a single point of failure within their platform!

 

Interestingly all Facebook’s LAN traffic was impacted by this as well hence staff couldn’t gain access to the required locations with their badges etc, it seems their external and internal routing is all driven by the same BGP configuration, which is a painful fault domain for them to have.

 

I know it’s a lot easier to sit this side of the disaster and make comments, which isn’t my intent, but I hope they don’t blame the person that managed to make the change and instead collectively prevent that scenario from being possible in the future!

I just really hope that the person who runs the problem don't get fired. How can we see on the link that @BertrandFR share with us, the problem apparently was cause by a single person. But I consider that this trouble was fault by so many people or team.
 

Don’t hey have Change management procedures or at least redundancy in the network link?


Comment