
The last couple of months I’m working with a lot of technologies that are supposed to be resilient against all kinds of failures like, Cassandra, Apache Akka, Apache Kafka, and Elastic Search. However, how can one be assured that these by nature resilient technologies are correctly configured and that there aren’t any hidden single points of failure?
Netflix had this same issue and they use the Chaos-Monkey to check for those failure points. The Chaos Monkey is a tool, which introduces random failures in your application infrastructure so one can test if the application is resilient enough. This basically boils down to the tool randomly rebooting, killing and re-imaging servers, pulling network plugs or killing processes. Which, at first, sounds really scary. The idea is that if one anticipates failure, one can prepare, respond and combat it.
As with any new technology it’s good to start small and isolated. Because the Netfix version of Chaos Monkey is Amazon EC2 only, we first tried the tool on a test environment we had in EC2. In a couple of minutes the application, which functioned fine in previous tests, started to display corrupt data and eventually went belly up. Post mortem analysis of the application indicated that a few software components couldn’t synchronise fast enough to handle the repetitive shutdown actions.
In the coming weeks we went back and forth, each time fixing another oversight or shortcut that became apparent. And while I admit that the things we fixed weren’t that life threatening anymore, after two weeks we noticed something interesting; we became (healthy) paranoid by distrusting the reliability of servers or infrastructure. We felt hunted by server and infrastructure failures and did our very best to evade or anticipate them. In the back of our minds failure was always an option. Even when talking about the use case, resilience to failure already was taken into account by hart. This mental change was improving the resilience of our application on a completely different level.
Seeing what impact this technology has on the team, we tried applying this technology to the production platform that runs on our private IAAS cloud. Because that cloud doesn’t use EC2, one can use WAZ monkey to perform the same action on Azure and (on-premise) Windows Azure Pack. For us, that was a little over the top, since we shared an internal subscription with another (more traditional) team. We solved this by starting small and added the following command to one of our Linux server startup scripts;
sleep $[($RANDOM)+300]s && reboot now
We now had a small chaos monkey in production, rebooting the server at a random point in time. One unexpected thing we ran into is that many IAAS (private) cloud providers measure SLA’s of image uptime regardless of user action. This means that if a user reboots an image, this is counted as offline as if the server had a hosting failure. While one might argue that one should monitor the service provided by the servers, for most use cases this is quite sane; you want someone to restart the web frontend server if for e.g. a windows update told a server to shutdown instead of reboot. However, if one starts to periodically reboot server at random times to test for reliability, one can imagine an IAAS manager getting a little bit nervous by this. Either modifying the SLA to annotate reboots as maintenance in the SLA or not monitoring the running server (only the hosting platform) usually resolves this issue.
In our experience Chaos Monkey is a useful tool, which can help teams to increase reliability of their applications. Given the unconventional approach, Chaos Monkey is a tool that not only the dev-ops team, but also the rest of an organisation needs to get used to.