Facebook Outage 4th September 2021. | Veeam Community Resource Hub

wolff.mateus · 2021-10-05T21:21:29+00:00

As everybody knows, yesterday Facebook has the biggest outage ever register in the company. Just to be clear it has noted that the problem is about DNS and BGP protocol through the network. So, it wasn't attack. It wasn't ransomware. It wasn't hardware problem. I think that is important to talk about these big problems with IT companies, because is very common hear and see so many fake news about this kind of outage. Here is a good post of CloudFlare about the problem: https://blog.cloudflare.com/october-2021-facebook-outage/

+12

Mildur
Veeam Product Management
Forum|Forum|4 years ago
October 5, 2021

Thanks for sharing the link. It gives a good insight what happened yesterday evening.

Senior Analyst, Product Management @ Veeam Software

Like

+23

MicoolPaul
Forum|Forum|4 years ago
October 5, 2021

There will inevitably need to be some serious questions asked that a core config change can impact the whole organisation in this way! They certainly found a single point of failure within their platform!

Interestingly all Facebook’s LAN traffic was impacted by this as well hence staff couldn’t gain access to the required locations with their badges etc, it seems their external and internal routing is all driven by the same BGP configuration, which is a painful fault domain for them to have.

I know it’s a lot easier to sit this side of the disaster and make comments, which isn’t my intent, but I hope they don’t blame the person that managed to make the change and instead collectively prevent that scenario from being possible in the future!

Michael Paul - Opinions are my own and do not necessarily reflect the opinion of Veeam | https://micoolpaul.com | Mastodon: @micoolpaul@masto.nu | Bluesky: @micoolpaul.com

Like

+7

BertrandFR
Influencer
Forum|Forum|4 years ago
October 6, 2021

https://engineering.fb.com/2021/10/05/networking-traffic/outage-details/

I was waiting the post mortem outage from Facebook for science purpose :grin:

Like

+12

wolff.mateus
Author
Veeam Vanguard
Forum|Forum|4 years ago
October 6, 2021

There will inevitably need to be some serious questions asked that a core config change can impact the whole organisation in this way! They certainly found a single point of failure within their platform!

Interestingly all Facebook’s LAN traffic was impacted by this as well hence staff couldn’t gain access to the required locations with their badges etc, it seems their external and internal routing is all driven by the same BGP configuration, which is a painful fault domain for them to have.

I know it’s a lot easier to sit this side of the disaster and make comments, which isn’t my intent, but I hope they don’t blame the person that managed to make the change and instead collectively prevent that scenario from being possible in the future!

I just really hope that the person who runs the problem don't get fired. How can we see on the link that @BertrandFR share with us, the problem apparently was cause by a single person. But I consider that this trouble was fault by so many people or team.

IT Lover | Veeam Vanguard | VUG Leader | Blogger

Like

+2

vAdmin
Influencer
Forum|Forum|4 years ago
October 7, 2021

There will inevitably need to be some serious questions asked that a core config change can impact the whole organisation in this way! They certainly found a single point of failure within their platform!

Interestingly all Facebook’s LAN traffic was impacted by this as well hence staff couldn’t gain access to the required locations with their badges etc, it seems their external and internal routing is all driven by the same BGP configuration, which is a painful fault domain for them to have.

I know it’s a lot easier to sit this side of the disaster and make comments, which isn’t my intent, but I hope they don’t blame the person that managed to make the change and instead collectively prevent that scenario from being possible in the future!

I just really hope that the person who runs the problem don't get fired. How can we see on the link that @BertrandFR share with us, the problem apparently was cause by a single person. But I consider that this trouble was fault by so many people or team.

Don’t hey have Change management procedures or at least redundancy in the network link?

Microsoft Azure Solutions Architect Expert, VMware Certified Professional - DCV 7, Citrix Certified Professional Virtualization (CCP-V), ITIL Practitioner, Certified Cybersecurity

Like

Sign up

Login to the community