Download as pdf or txt
Download as pdf or txt
You are on page 1of 9

Introduction


Resilience is not about reducing errors. It's about enhancing the positive capabilities of people and organizations
that allow them to adapt effectively and safely under pressure.

Chaos engineering is the discipline of experimenting on a software system in production in order to build confidence in the
system's capability to withstand turbulent and unexpected conditions.

Concept

In software development, a given software system's ability to tolerate failures while still ensuring adequate quality of
service—often generalized as resiliency—is typically specified as a requirement. However, development teams often fail to
meet this requirement due to factors such as short deadlines or lack of knowledge of the field. Chaos engineering is a
technique to meet the resilience requirement.

Chaos engineering can be used to achieve resilience against:

Infrastructure failures
Network failures
Application failures

What is it?
Chaos engineering is a discipline of experimenting on a system to build confidence in the system’s capability to withstand
turbulent conditions in production. With chaos engineering, we intentionally try to break our system under certain stresses
to determine potential outages, locate weakness, and improve resiliency.

Chaos engineering is different from software testing or fault injection. Chaos engineering is used for all sorts of
requirements and unpredictable situations, including traffic spikes, race conditions, and more.


With chaos engineering, we are trying to learn how an entire system reacts when an individual component is failing.
For example, chaos engineering can help answer functionality questions like these:

What happens when a service is not accessible, one way or another?


What is the result of outages when an application receives too much traffic or when it is not available?
Will we experience cascading errors when a single point of failure crashes an app?
What happens when our application goes down?
What happens when there is something wrong with networking?

Benefits?
Chaos engineering offers many benefits that other forms of software testing or failure testing cannot. Failure tests can
only examine a single condition in a binary breakdown. This doesn’t allow us to test a system under unprecedented or
unexpected stresses.

Chaos engineering, on the other hand, can account for complex, diverse, and real-world issues or outages. With chaos
engineering, we can fix issues and gain new insights about an application for future improvements.

Chaos experiments help to reduce failures and outages while improving our understanding of our system design. Chaos
engineering improves a service’s availability and durability, so customers are less disrupted by outages. Chaos engineering
can also help prevent revenue losses and lower maintenance costs at the business level.

Principles

Advances in large-scale, distributed software systems are changing the game for software engineering. As an industry, we are
quick to adopt practices that increase flexibility of development and velocity of deployment. An urgent question follows on
the heels of these benefits: How much confidence we can have in the complex systems that we put into production?

Even when all of the individual services in a distributed system are functioning properly, the interactions between those
services can cause unpredictable outcomes. Unpredictable outcomes, compounded by rare but disruptive real-world events that
affect production environments, make these distributed systems inherently chaotic.

We need to identify weaknesses before they manifest in system-wide, aberrant behaviors. Systemic weaknesses could take the
form of: improper fallback settings when a service is unavailable; retry storms from improperly tuned timeouts; outages when
a downstream dependency receives too much traffic; cascading failures when a single point of failure crashes; etc. We must
address the most significant weaknesses proactively, before they affect our customers in production. We need a way to manage
the chaos inherent in these systems, take advantage of increasing flexibility and velocity, and have confidence in our
production deployments despite the complexity that they represent.
An empirical, systems-based approach addresses the chaos in distributed systems at scale and builds confidence in the
ability of those systems to withstand realistic conditions. We learn about the behavior of a distributed system by observing
it during a controlled experiment. We call this Chaos Engineering.

Chaos In Practice


Learn how to destroy your systems productively.

To specifically address the uncertainty of distributed systems at scale, Chaos Engineering can be thought of as the
facilitation of experiments to uncover systemic weaknesses. These experiments follow four steps:

1. Start by defining ‘steady state’ as some measurable output of a system that indicates normal behavior.
2. Hypothesize that this steady state will continue in both the control group and the experimental group.
3. Introduce variables that reflect real world events like servers that crash, hard drives that malfunction, network
connections that are severed, etc.
4. Try to disprove the hypothesis by looking for a difference in steady state between the control group and the
experimental group.

The harder it is to disrupt the steady state, the more confidence we have in the behavior of the system. If a weakness is
uncovered, we now have a target for improvement before that behavior manifests in the system at large.
Build a Hypothesis around Steady State Behavior
Focus on the measurable output of a system, rather than internal attributes of the system. Measurements of that output over
a short period of time constitute a proxy for the system’s steady state. The overall system’s throughput, error rates,
latency percentiles, etc. could all be metrics of interest representing steady state behavior. By focusing on systemic
behavior patterns during experiments, Chaos verifies that the system does work, rather than trying to validate how it works.

Vary Real-world Events


Chaos variables reflect real-world events. Prioritize events either by potential impact or estimated frequency. Consider
events that correspond to hardware failures like servers dying, software failures like malformed responses, and non-failure
events like a spike in traffic or a scaling event. Any event capable of disrupting steady state is a potential variable in a
Chaos experiment.

Run Experiments in Production


Systems behave differently depending on environment and traffic patterns. Since the behavior of utilization can change at
any time, sampling real traffic is the only way to reliably capture the request path. To guarantee both authenticity of the
way in which the system is exercised and relevance to the current deployed system, Chaos strongly prefers to experiment
directly on production traffic.

Automate Experiments to Run Continuously


Running experiments manually is labor-intensive and ultimately unsustainable. Automate experiments and run them
continuously. Chaos Engineering builds automation into the system to drive both orchestration and analysis.
Minimize Blast Radius
Experimenting in production has the potential to cause unnecessary customer pain. While there must be an allowance for some
short-term negative impact, it is the responsibility and obligation of the Chaos Engineer to ensure the fallout from
experiments are minimized and contained.

Process
Define a steady-state hypothesis: You need to start with an idea of what can go awry. Start with a failure to inject and
predict an outcome for when it is running live.
Confirm the steady-state and simulate some real-world events: Perform tests using real-world scenarios to see how your
system behaves under particular stress conditions or circumstances.
Confirm the steady-state again: We need to confirm what changes occurred, so checking it again gives us insights into
system behavior.
Collect metrics and observe dashboards: You need to measure your system’s durability and availability. It is best
practice to use key performance metrics that correlate with customer success or usage. We want to measure the failure
against our hypothesis by looking at factors like impact on latency or requests per second.
Make changes and fix issues: After running an experiment, you should have a good idea of what is working and what needs
to be altered. Now we can identify what will lead to an outage, and we know exactly what breaks the system. So, go fix
it, and try again with a new experiment.

Pertubation Models
A chaos engineering tool implements a perturbation model. The perturbations, also called turbulences, are meant to mimic
rare or catastrophic events that can happen in production. To maximize the added value of chaos engineering, the
pertubations are expected to be realistic.

Server shutdowns
One perturbation model consists of randomly shutting down servers. Netflix' Chaos Monkey is an implementation of this
perturbation model.

Latency injections
Introduces communication delays to simulate degradation or outages in a network. For example, Chaos Mesh supports the
injection of latency.

Resource exhaustion
Eats up a given resource. For instance, Gremlin can fill the disk up.

Game Days & Chaos Engineering

The central idea behind a Game Day is to prepare systems and teams to be resilient and rehearsed. Readiness for systems is
always inclusive of things like delivery automation, circuit breakers, and independent scaling; readiness for teams must
include practical exercises in breaking and fixing things. In all other Incident Management professions, practitioners spend
focused time planning and rehearsing responses to different scenarios. Why not DevOps?

“You Want to Do What?!”


“Amazon and Netflix and Google and Facebook are truly incredible engineering organizations,” people say, “but we’re just 25
people trying to get by. How do we carve off time and focus to execute something like this, without proverbially shooting
our foot off?”

Game Days are a risky and tough proposition for teams growing their Incident Management practice, no question.

An important first step is determining where you are going to simulate failure. Ideally, a non-production environment can be
used as your team starts practicing how to practice. Staging and testing environments provide a low-risk proxy learning
environment for teams to experiment. If this is your first go at a Game Day, it’s probably best to start somewhere with a
containable blast radius. As you get used to both simulating disaster and working together to resolve issues, move your Game
Days into production.

Scenario Selection
Wherever you end up simulating some breakage, you should devote some time up-front to planning your event. I suggest
treating your early efforts here more like dress-rehearsals for your runbooks. I think for most teams, someone has penned a
“what to do if the database fails” document, even if the team has never actually seen that occur. Have a few of those in
your stable? Great options for your first Game Day exercise.

What I like about scenarios like this, whether someone has devoted time to writing a document or not, is they are filled
with untested assumptions. Assumptions about how systems are going to behave, how long particular operations will take,
which metrics we should be mindful of. Testing and verifying those assumptions is a desired outcome of a Game Day. Along the
way, you get the added benefit of testing the procedure, or suggested actions present within the scenario. Left untested,
you will never know until it is way too late how many of those assumptions are true—or badly misleading.

At Netflix, Game Days run regularly, with a wide variety of chaotic simulations in play. For your first run, don’t fire up
Chaos Kong and expect things to go well. Look for some risky scenarios, things that are new or poorly tested. A good first
start would be to develop two, maybe three scenarios as candidates. Some criteria to consider for scenario selections:

Any candidate scenario should include clear triggers or methods to invoke the scenario. Wait before tackling a scenario
with a complicated set of pre-conditions. Stripe’s ‘kill-9’ exercise is a great example of learning with a simple
trigger.
Have a downtime estimate, then double it. If you can’t get comfortable with that number, look for other options.
Ensure there are some well thought out victory conditions in your scenario. “We know we successfully resolved the issue
because…” Game Days with lingering blast radius will not be repeated.

Bounded Chaos
Surviving your first Game Day (and having enough to show for it to justify a second) is all about containing the actual
amount of chaos you’re going to introduce. You can already imagine all the exciting ways something like this can go wrong,
your job is to ensure it doesn’t.

Spread the exercise over two sprints: preparation and execution. Execution is the relatively easy step; really all you’re
doing is blocking the day or afternoon as unavailable in that sprint. Preparation can and should take up a decent chunk of
an iteration.

Having developed a couple of candidate scenarios, work through those details. If there are pre-conditions to trigger the
incident, get those in place. If there are specific team members participating, ensure they will be scheduled for on-call
during the planned time. If you need assistance from other teams to ensure you’ve successfully recovered, get that lined up
for The Big Day.

Opinions vary, but I’d encourage a first-time team to discuss the candidate scenarios before hand. Your goal here is to
create a small event with valuable learnings, not put the team in trouble. You can retain surprise by varying which scenario
is enacted, and the precise timing. On the day of, invoke your favorite random number generator (dice/code/hat), pick a
scenario, and fire away!

Whatever your Game Day runtime, make sure to reserve some solid Post-Incident Review time for the group. Order pizza and
give everyone time to relax and review and discuss openly how the Game Day went. Unlike real live-fire exercises, this time
can be anticipated and accounted for in your planning, so make the most of it!

Measuring Success
The clearest metric we can all agree to in Incident Management is Time to Resolve. Certainly teams who practice Game Day
exercises should expect to see that metric decreasing as they practice increasingly complex failure scenarios. So too should
the beginning team expect to see reductions in the time they spend resolving issues, but I would expect this to be focused
in scenarios that are most like the ones you’ve exercised. So, don’t look to Game Days as a simple way to reduce MttR, it’s
a strategic play, one that takes time to impact all cases of Incident Management.

I think a different metric to watch, and expect good things from, is the expected vs. actual time spent resolving any
specific scenario. Teams just starting out in this practice tend to deeply underestimate the time they collectively spend in
Response and Remediation. Estimates for execution of infrequently used tools, like recovery systems, are always going to be
off by a wide margin.

Perhaps the most important outcome you should watch for early efforts here is one that’s a little tough to measure:
teamwork. Collaboration is hard, as any DevOps team will tell you. Collaboration under stress and time constraints is really
hard—late nights with lots on the line tend to bring out the worst in humans. This is why teams need to practice before
things catch on fire.

Understanding your teammates approaches, biases, and methods is a key component to an effective Incident Management team. We
develop trust and respect for each other by working together, helping each other, and sharing efforts to solve problems.
This is the primary success metric to observe as you tackle Game Days: a foundation of collaboration and trust upon which to
grow your Incident Management team.

Further Knowledge

The Chaos Engineering Collection 📖

Game Day - AWS 📖

Just Run Game Day 📖

Securing Online Gaming Combine Chaos Engineering with DevOps Practices 📖

Service Ownership @Slack 📹

Rethinking How the Industry Approaches Chaos Engineering 📹

Mastering Chaos - A Netflix Guide to Microservices 📹

Testing In Production, The Netflix Way 📹

Chaos & Intuition Engineering at Netflix 📹

You might also like