Latest Amazon AWS Outage: Where is Chaos Monkey When You Need Him?

Last week’s Amazon AWS outage impacted many large customers including Apple.  Looks like someone finger checked a command to take down some servers while trying to debug a problem and it brought down a larger set of servers than intended.  Of course, when something like this happens there tends to be a cascade of unintended consequences. In this case, there was widespread disruption to internet traffic across the U.S.

Amazon is known for designing their systems for resilience.  People make mistakes, wrong commands get entered, bugs get introduced, and systems fail. The problem with this outage was that the subsystems that were brought down had not been restarted for many years and the system had experienced massive growth over the last several years.  The process of restarting these servers and running the required system validation checks took hours.

Fail Fast

The outage illustrates that long MTBF (Mean Time Before Failure) is actually a warning sign and goes against a major DevOps principle of fail fast. It can give a warning to potential fragility in the system.  The best defense against major unexpected failures is to fail often.  While I’m sure Amazon designed for resiliency, proactively injecting failure and chaos in test environments can improve resiliency.  It also points out that your test data, as well as your test environments, needs to mirror your production environments as closely as possible.  The constraint was that the safety checks to validate the integrity of the metadata took longer than expected. This problem only surfaced through the need to do a system restart.

Systems are always in a dynamic state of change and system resiliency is about the capacity of a system to absorb change and disturbance while still retaining essentially the same function. In this example, a combination of change and disturbance exposed a system fault in the production environment causing a lot of grief. The good news is that the Amazon team appears to have a good post-mortem process in place to continuously improve and learn from events like this which will ultimately result in a even more resilient environment.

Find me on Twitter if you want to discuss further…..

Share this post:
Tweet about this on TwitterShare on FacebookShare on LinkedInGoogle+

Leave a Reply

Your email address will not be published. Required fields are marked *