This is why when AWS went down and took many websites down with it, it was such a big deal. The good news is that it’s all in the past as things appear to be back up and running, but for those curious as to what went wrong, Amazon posted an explanation on its website. It is a rather lengthy read and it is also rather technical, but the gist of it is that it was due to human error, or more specifically a typo.
According to Amazon, “Unfortunately, one of the inputs to the command was entered incorrectly and a larger set of servers was removed than intended. The servers that were inadvertently removed supported two other S3 subsystems.” As to what took Amazon so long to get it back up, it was because of various safety checks in place, and also due to the fact that some of their servers had not been restarted in years.
“While this is an operation that we have relied on to maintain our systems since the launch of S3, we have not completely restarted the index subsystem or the placement subsystem in our larger regions for many years. S3 has experienced massive growth over the last several years and the process of restarting these services and running the necessary safety checks to validate the integrity of the metadata took longer than expected.”