This was messaged to me by a colleague of mine in the UK on Monday 4th October 2021. It only occurred to me that he was referring to the Facebook outage the same day.
For me it was Whatapp, and the down detector was reporting tens of thousands of connectivity issues. It wasn't just WhatsApp it was the entire Facebook network and services that were down.
How did this happen?
So after hours of lost business, dropped stock prices and networth, The Facebook network was back on track.
So the billion-dollar question is how did this happen?
There was initially some speculation of a cyberattack but a tweet from Cloudflare John Graham-Cumming, said that it was most likely a BGP update issue. and even Facebook sent a message about it.
Other articles also said it was BGP misconfiguration that caused the outage. So how can such a simple protocol do so much damage?
The Border Gateway Protocol (BGP)
I am not going into too many details of BGP here, as this article explains it nicely as well as this YouTube video (please read up on your subnetting to get a better idea). Large corporations and universities do possess and control BGP routes on their premises. Of course, possession and control of such devices and their link to the ISP/Internet are to be managed very carefully. There are normally numerous processes and checks before a BGP route is changed, and these go through what you call change management procedures.
Sometimes mistakes do get by the checks on the system and can cause disastrous results as it gets replicated across the globe (not just in the country, the entire world). Though it seems that mistakes do happen more often than not.
As part of the lessons learnt, these mistakes are documented and the change management system is updated with these new checks.
Could they Have Solved it Faster?
Most of you would be asking or saying, if it was a simple fix, why couldn't they have solved it sooner? The main problem is that the Facebook network including the Software Defined Networks (SDN) is using the BGP routing. Everything is integrated including their access control. This means their network engineers could not remote into the facilities, the persons working at the facilities don't have the clearance to override the access to the facilities (think their smart card/biometric systems cannot connect to another network to identify the person).
So in most cases, they would have to send their network engineer teams to the DataCentre sites with the manual override keys and devices to get access to the systems. they may literally have to plug into the equipment using their laptops and console cables to implement the fix.
All of this takes time. That is why you would want to plan all your configurations first before making any changes. And then you implement your changes on one device at a time, then test these changes using simple tools.
Efficiency is Brittle!
Back to the initial statement. Looking at the wider perspective, in the quest to make systems more efficient, we sacrifice redundancies thus creating what can be called a "critical path" in your business process. Should any component in that critical path fail (the weak link), your entire process fails, resulting in lost business value.
So how do we harden our "critical path" business process from such failure?
There are multiple ways that can be implemented, but I am looking at the top three in ascending order of cost (time, money, resources):
1) Look at the likelihood of each part of your process failing, its impact, and how quickly it can be resolved. Develop independent processes that resolve this AND update these processes periodically.
It will be integrated as part of your operational Monitoring and Control of the system.
2) Develop redundancies for each process. Which involve a backup system or configuration protocol.
3) Develop another redundant path in the process. It's the most costly but gives the safest form of redundancy and resilience.
Takeaway
So what's the takeaway of all this? Always remember that everything is a process. There will always be a weak link that will fail first. You can harden this process or add redundancy/contingency to cater for failure.
But after all that is said and done, everyone makes mistakes, it's your process of fixing them and learning from them that is important.
"You may not live long enough to learn from everyone's mistakes, but at least you can learn from yours..."
Naresh
2021/10/10
No comments:
Post a Comment