Facebook’s outage cost Mark Zuckerberg billions of dollars
Some data centers are big buildings that house huge numbers of computers that store data and do the heavy lifting to keep the network running. Others are smaller facilities where device owners’ requests for data are sent and then moved using Facebook’s backbone network to larger data centers. That is where the data that your app needs is discovered and sent to your phone.
Routers are used to determine where all of the incoming and outgoing data should be sent. And occasionally Facebook engineers need to take the backbone offline for maintenance. And yesterday, a command was issued that was supposed to check the available capacity of Facebook’s backbone. Instead, it accidentally took down all of the connections in the backbone network which disconnected Facebook’s data centers around the world.
Facebook has a system in place that is designed to audit commands to make sure that an accidental outage like the one that went down yesterday doesn’t take place. But the audit tool had its own bug that prevented it from stopping the command from shutting down the system.
Facebook says that it will learn from the outage so that it never happens again
Once Facebook was able to restore its backbone network connectivity, everything came back up. But Facebook had another problem to consider. If it turned all of its services back on at once, the amount of traffic running through the system could cause the system to crash again. But thanks to the “storm drills” that Facebook has been practicing, it was well prepared to handle the incident.
The social media company says that it will learn from the outage so that it never happens again. “Every failure like this is an opportunity to learn and get better, and there’s plenty for us to learn from this one. After every issue, small and large, we do an extensive review process to understand how we can make our systems more resilient. That process is already underway.”