Last week’s Amazon AWS outage impacted many large customers including Apple. Looks like someone finger checked a command to take down some servers while trying to debug a problem and it brought down a larger set of servers than intended. Of course, when something like this happens there tends to be a cascade of unintended consequences. In this case, there was widespread disruption to internet traffic across the U.S.
Amazon is known for designing their systems for resilience. People make mistakes, wrong commands get entered, bugs get introduced, and systems fail. The problem with this outage was that the subsystems that were brought down had not been restarted for many years and the system had experienced massive growth over the last several years. The process of restarting these servers and running the required system validation checks took hours.
Fail Fast
The outage illustrates that long MTBF (Mean Time Before Failure) is actually a warning sign and goes against a major DevOps principle of fail fast. It can give a warning to potential fragility in the system. The best defense against major unexpected failures is to fail often. While I’m sure Amazon designed for resiliency, proactively injecting failure and chaos in test environments can improve resiliency. It also points out that your test data, as well as your test environments, needs to mirror your production environments as closely as possible. The constraint was that the safety checks to validate the integrity of the metadata took longer than expected. This problem only surfaced through the need to do a system restart.
Find me on Twitter if you want to discuss further…..