Facebook Outage 4th September 2021.


Userlevel 7
Badge +11

As everybody knows, yesterday Facebook has the biggest outage ever register in the company. 

 

Just to be clear it has noted that the problem is about DNS and BGP protocol through the network. So, it wasn't attack. It wasn't ransomware. It wasn't hardware problem.

 

I think that is important to talk about these big problems with IT companies, because is very common hear and see so many fake news about this kind of outage.

 

Here is a good post of CloudFlare about the problem:

 

https://blog.cloudflare.com/october-2021-facebook-outage/ 

 

 


5 comments

Userlevel 7
Badge +12

Thanks for sharing the link. It gives a good insight what happened yesterday evening.

Userlevel 7
Badge +22

There will inevitably need to be some serious questions asked that a core config change can impact the whole organisation in this way! They certainly found a single point of failure within their platform!

 

Interestingly all Facebook’s LAN traffic was impacted by this as well hence staff couldn’t gain access to the required locations with their badges etc, it seems their external and internal routing is all driven by the same BGP configuration, which is a painful fault domain for them to have.

 

I know it’s a lot easier to sit this side of the disaster and make comments, which isn’t my intent, but I hope they don’t blame the person that managed to make the change and instead collectively prevent that scenario from being possible in the future!

Userlevel 7
Badge +8

https://engineering.fb.com/2021/10/05/networking-traffic/outage-details/

I was waiting the post mortem outage from Facebook for science purpose :grin:

Userlevel 7
Badge +11

There will inevitably need to be some serious questions asked that a core config change can impact the whole organisation in this way! They certainly found a single point of failure within their platform!

 

Interestingly all Facebook’s LAN traffic was impacted by this as well hence staff couldn’t gain access to the required locations with their badges etc, it seems their external and internal routing is all driven by the same BGP configuration, which is a painful fault domain for them to have.

 

I know it’s a lot easier to sit this side of the disaster and make comments, which isn’t my intent, but I hope they don’t blame the person that managed to make the change and instead collectively prevent that scenario from being possible in the future!

I just really hope that the person who runs the problem don't get fired. How can we see on the link that @BertrandFR share with us, the problem apparently was cause by a single person. But I consider that this trouble was fault by so many people or team.
 

Userlevel 7
Badge +2

There will inevitably need to be some serious questions asked that a core config change can impact the whole organisation in this way! They certainly found a single point of failure within their platform!

 

Interestingly all Facebook’s LAN traffic was impacted by this as well hence staff couldn’t gain access to the required locations with their badges etc, it seems their external and internal routing is all driven by the same BGP configuration, which is a painful fault domain for them to have.

 

I know it’s a lot easier to sit this side of the disaster and make comments, which isn’t my intent, but I hope they don’t blame the person that managed to make the change and instead collectively prevent that scenario from being possible in the future!

I just really hope that the person who runs the problem don't get fired. How can we see on the link that @BertrandFR share with us, the problem apparently was cause by a single person. But I consider that this trouble was fault by so many people or team.
 

Don’t hey have Change management procedures or at least redundancy in the network link?

Comment