• Blog
  • Info Support
  • Career
  • Training
  • International Group
  • Info Support
  • Blog
  • Career
  • Training
  • International Group
  • Search
logo InfoSupport
  • Latest blogs
  • Popular blogs
  • Experts
      • All
      • Bloggers
      • Speakers
  • Meet us
  • About us
    • nl
    • en
    • .NET
    • 3D printing
    • Advanced Analytics
    • Agile
    • Akka
    • Alexa
    • Algorithms
    • Api's
    • Architectuur
    • Artificial Intelligence
    • ATDD
    • Augmented Reality
    • AWS
    • Azure
    • Big Data
    • Blockchain
    • Business Intelligence
    • Chatbots
    • Cloud
    • Code Combat
    • Cognitive Services
    • Communicatie
    • Containers
    • Continuous Delivery
    • CQRS
    • Cyber Security
    • Dapr
    • Data
    • Data & Analystics
    • Data Science
    • Data Warehousing
    • Databricks
    • DataOps
    • Developers life
    • DevOps
    • Digital Days
    • Digital Twin
    • Docker
    • eHealth
    • Enterprise Architecture
    • Event Sourcing
    • Hacking
    • Infrastructure & Hosting
    • Innovatie
    • Integration
    • Internet of Things
    • Java
    • Machine Learning
    • Microservices
    • Microsoft
    • Microsoft Bot Framework
    • Microsoft Data Platform
    • Mobile Development
    • Mutation Testing
    • Open source
    • Pepper
    • Power BI
    • Privacy & Ethiek
    • Python
    • Quality Assistance & Test
    • Quality Assurance & Test
    • Requirements Management
    • Scala
    • Scratch
    • Security
    • SharePoint
    • Software Architecture
    • Software development
    • Software Factory
    • SQL Server
    • SSL
    • Start-up
    • Startup thinking
    • Stryker
    • Test Quality
    • Testing
    • TLS
    • TypeScript
    • Various
    • Web Development
    • Web-scale IT
    • Xamarin
    • All
    • Bloggers
    • Speakers
Home » Opening your dev-ops zoo to the chaos monkey
  • Opening your dev-ops zoo to the chaos monkey

    • By Rolf Huisman
    • Cloud 8 years ago
    • Cloud 0 comments
    • Cloud Cloud
    Opening your dev-ops zoo to the chaos monkey

    The last couple of months I’m working with a lot of technologies that are supposed to be resilient against all kinds of failures like, Cassandra, Apache Akka, Apache Kafka, and Elastic Search. However, how can one be assured that these by nature resilient technologies are correctly configured and that there aren’t any hidden single points of failure?

    Netflix had this same issue and they use the Chaos-Monkey to check for those failure points. The Chaos Monkey is a tool, which introduces random failures in your application infrastructure so one can test if the application is resilient enough. This basically boils down to the tool randomly rebooting, killing and re-imaging servers, pulling network plugs or killing processes. Which, at first, sounds really scary. The idea is that if one anticipates failure, one can prepare, respond and combat it.

    As with any new technology it’s good to start small and isolated. Because the Netfix version of Chaos Monkey is Amazon EC2 only, we first tried the tool on a test environment we had in EC2. In a couple of minutes the application, which functioned fine in previous tests, started to display corrupt data and eventually went belly up. Post mortem analysis of the application indicated that a few software components couldn’t synchronise fast enough to handle the repetitive shutdown actions.

    In the coming weeks we went back and forth, each time fixing another oversight or shortcut that became apparent. And while I admit that the things we fixed weren’t that life threatening anymore, after two weeks we noticed something interesting; we became (healthy) paranoid by distrusting the reliability of servers or infrastructure. We felt hunted by server and infrastructure failures and did our very best to evade or anticipate them. In the back of our minds failure was always an option. Even when talking about the use case, resilience to failure already was taken into account by hart. This mental change was improving the resilience of our application on a completely different level.

    Seeing what impact this technology has on the team, we tried applying this technology to the production platform that runs on our private IAAS cloud. Because that cloud doesn’t use EC2, one can use WAZ monkey to perform the same action on Azure and (on-premise) Windows Azure Pack. For us, that was a little over the top, since we shared an internal subscription with another (more traditional) team. We solved this by starting small and added the following command to one of our Linux server startup scripts;

    sleep $[($RANDOM)+300]s && reboot now

    We now had a small chaos monkey in production, rebooting the server at a random point in time. One unexpected thing we ran into is that many IAAS (private) cloud providers measure SLA’s of image uptime regardless of user action. This means that if a user reboots an image, this is counted as offline as if the server had a hosting failure. While one might argue that one should monitor the service provided by the servers, for most use cases this is quite sane; you want someone to restart the web frontend server if for e.g. a windows update told a server to shutdown instead of reboot. However, if one starts to periodically reboot server at random times to test for reliability, one can imagine an IAAS manager getting a little bit nervous by this. Either modifying the SLA to annotate reboots as maintenance in the SLA or not monitoring the running server (only the hosting platform) usually resolves this issue.

    In our experience Chaos Monkey is a useful tool, which can help teams to increase reliability of their applications. Given the unconventional approach, Chaos Monkey is a tool that not only the dev-ops team, but also the rest of an organisation needs to get used to.

    Share this

Rolf Huisman

View profile

Related IT training

Go to training website

Related Consultancy solutions

Go to infosupport.com

Related blogs

  • Validating Azure Bicep templates with PSRule

    Validating Azure Bicep templates with PSRule Caspar Eldermans - 6 months ago

  • Interpreteerbare AI als oplossing voor het Black Box Pr…

    Interpreteerbare AI als oplossing voor het Black Box Pr… Emiel Stoelinga - 2 years ago

  • Interpreteerbare AI als oplossing voor het Black Box Pr…

    Interpreteerbare AI als oplossing voor het Black Box Pr… Emiel Stoelinga - 2 years ago

Data Discovery Channel

  • Data+AI Summit 2023

  • Blijf je Azure cloud omgeving de baas met CloudXcellence

  • MLOps

Nieuwsbrief

* verplichte velden

Contact

  • Head office NL
  • Kruisboog 42
  • 3905 TG Veenendaal
  • T +31 318 552020
  • Call
  • Mail
  • Directions
  • Head office BE
  • Generaal De Wittelaan 17
  • bus 30 2800 Mechelen
  • T +32 15 286370
  • Call
  • Mail
  • Directions

Follow us

  • Twitter
  • Facebook
  • Linkedin
  • Youtube

Newsletter

Sign in

Extra

  • Media Library
  • Disclaimer
  • Algemene voorwaarden
  • ISHBS Webmail
  • Extranet
Beheer cookie toestemming
Deze website maakt gebruik van Functionele en Analytische cookies voor website optimalisatie en statistieken.
Functioneel Always active
De technische opslag of toegang is strikt noodzakelijk voor het legitieme doel het gebruik mogelijk te maken van een specifieke dienst waarom de abonnee of gebruiker uitdrukkelijk heeft gevraagd, of met als enig doel de uitvoering van de transmissie van een communicatie over een elektronisch communicatienetwerk.
Voorkeuren
De technische opslag of toegang is noodzakelijk voor het legitieme doel voorkeuren op te slaan die niet door de abonnee of gebruiker zijn aangevraagd.
Statistieken
De technische opslag of toegang die uitsluitend voor statistische doeleinden wordt gebruikt. De technische opslag of toegang die uitsluitend wordt gebruikt voor anonieme statistische doeleinden. Zonder dagvaarding, vrijwillige naleving door uw Internet Service Provider, of aanvullende gegevens van een derde partij, kan informatie die alleen voor dit doel wordt opgeslagen of opgehaald gewoonlijk niet worden gebruikt om je te identificeren.
Marketing
De technische opslag of toegang is nodig om gebruikersprofielen op te stellen voor het verzenden van reclame, of om de gebruiker op een website of over verschillende websites te volgen voor soortgelijke marketingdoeleinden.
Manage options Manage services Manage {vendor_count} vendors Read more about these purposes
Voorkeuren
{title} {title} {title}