Thanks for sharing the link. It gives a good insight what happened yesterday evening.
There will inevitably need to be some serious questions asked that a core config change can impact the whole organisation in this way! They certainly found a single point of failure within their platform!
Interestingly all Facebook’s LAN traffic was impacted by this as well hence staff couldn’t gain access to the required locations with their badges etc, it seems their external and internal routing is all driven by the same BGP configuration, which is a painful fault domain for them to have.
I know it’s a lot easier to sit this side of the disaster and make comments, which isn’t my intent, but I hope they don’t blame the person that managed to make the change and instead collectively prevent that scenario from being possible in the future!
There will inevitably need to be some serious questions asked that a core config change can impact the whole organisation in this way! They certainly found a single point of failure within their platform!
Interestingly all Facebook’s LAN traffic was impacted by this as well hence staff couldn’t gain access to the required locations with their badges etc, it seems their external and internal routing is all driven by the same BGP configuration, which is a painful fault domain for them to have.
I know it’s a lot easier to sit this side of the disaster and make comments, which isn’t my intent, but I hope they don’t blame the person that managed to make the change and instead collectively prevent that scenario from being possible in the future!
I just really hope that the person who runs the problem don't get fired. How can we see on the link that @BertrandFR share with us, the problem apparently was cause by a single person. But I consider that this trouble was fault by so many people or team.
There will inevitably need to be some serious questions asked that a core config change can impact the whole organisation in this way! They certainly found a single point of failure within their platform!
Interestingly all Facebook’s LAN traffic was impacted by this as well hence staff couldn’t gain access to the required locations with their badges etc, it seems their external and internal routing is all driven by the same BGP configuration, which is a painful fault domain for them to have.
I know it’s a lot easier to sit this side of the disaster and make comments, which isn’t my intent, but I hope they don’t blame the person that managed to make the change and instead collectively prevent that scenario from being possible in the future!
I just really hope that the person who runs the problem don't get fired. How can we see on the link that @BertrandFR share with us, the problem apparently was cause by a single person. But I consider that this trouble was fault by so many people or team.
Don’t hey have Change management procedures or at least redundancy in the network link?