Chaos Engineering Your Incident Management Process
Chaos engineering is an essential part of creating an effective incident management system and implementing processes that can help keep you in control when real chaos threatens your code.
By Robert Ross on 8/24/2021
Think of chaos engineering like a flu shot. You inject an outside element into your system to help build future immunity. This element, while somewhat dangerous, allows your body to adapt to something much more intense in the future. The immune system identifies its shortcomings in the face of something unfamiliar and dangerous, and in response builds a resistance.
Chaos engineering is designed to push a system to its limits in order to find the weak points, and ultimately reinforce them. Chaos engineering is an essential part of creating an effective incident management system, and implementing processes that can help keep you in control when real chaos threatens your code.
How to Perform Chaos Engineering on Your System
Chaos Experiments
For a better understanding of chaos engineering in action, here are some simple examples of chaos experiments you could run to produce some technical chaos.
1. Add a latency to a postgres instance
This is a really simple one. A possible outcome of this is that maybe requests to your app servers start queuing, and your whole site topples over - something I’ve experienced more than a few times in my career.
2. Blackholing traffic to redis
Blackholing your traffic intentionally to redis might break caching, but your site might continue to operate - albeit a little more slowly.
3. Kill processes randomly
Killing processes randomly means your requests are left half fulfilled - leaving your data in an inconsistent state.
All of these examples of chaos engineering are super simple, but address the basic idea of chaos engineering - injecting something unknown into your system, seeing what happens, and improving your incident response and management by solving the problems that arise.
Process Experiments
Process experiments can be run in a very similar fashion to chaos experiments. While chaos experiments focus more on computers and how they interact with each other when thrown into an unexpected situation, a process experiment offers a lot more possible outcomes. I like process experiments because people are always going to solve problems in different ways. I might look for logs and find what I need to find in a completely different way than maybe my colleague would. If there’s an outage, and you ask someone to update the status page, there are a bunch of different ways they could do that - I might revert a merge, but someone else might use Spinnaker to roll back a deploy. There are so many ways we can change systems with only our keyboards - which makes running experiments on these processes an essential part of being prepared for an incident.
So how do you run an experiment on a process?
One way is to hold a surprise meeting. Last year, I put my co-founder Dylan in an empty conference room, and asked him to update our status page. I didn’t tell him I was going to do this; all it took was 30 seconds, and Dylan found out that he didn’t have the credentials to log in.
While we have our own status page product now, at the time this would have meant that an on-call engineer would have had no way to publicly disclose an incident to our customers. All it took was a three-minute experiment to identify a pretty significant gap in our process, and one that we’d now be able to handle easily in the future.
You can do this for any common operation you find yourself or your team performing during incident response. Take any of the following techniques - we’ll call them “technique bricks” - and put them under a microscope and see how your process could be improved or refined:
Rolling back a bad deploy
Adding storage capacity
Turning off a feature flag
Finding distributed traces
Updating replica config
Finding logs
Skipping test suites
Disabling an auto scaler
Purging caches
Escalating to another team
Mering a PR without approval
SSHing into a box
There are so many things we do when responding to incidents that it’s easy to look at them all as one large process. Instead, start to think of them as individual techniques - a single “brick” in a larger structure - that you can practice. By mastering the techniques that become part of your incident response process, you’ll find that more than anything, people mitigate incidence - not process. This is why at FireHydrant, we try to remove the heavy lifting that all too often hampers a process, like updating status pages, creating JIRA tickets and slack rooms. We institutionalize and automate the heavy lifting to let people focus on putting out the fire.
Guide, Don’t Prescribe
The other thing that processes get wrong all of the time are that they’re prescribing how to do something.
“Here’s how to add storage to the database.”
But what if it was a red herring? What if your issue wasn’t storage being out? Maybe an error was just mislabeled, or there was another issue further up the chain? This is why you don’t want to prescribe processes. Instead, teach the techniques and test those instead. Take each of our “technique bricks” in the bulleted list above, and think of each as an individual Lego brick. What’s interesting about Lego bricks is that when you have a lot of them, you can create any shape, and then break them down and build something completely different with the same bricks.
If someone knows how to use a single brick - one of your techniques - in multiple situations, they can manage different incidents more effectively. If someone only knows how to build a single thing with their bricks - say, a spaceship (restarting a database under a very specific set of conditions) - they’re limited in their ability to resolve other incidents. This is why it’s essential to teach how to use the mitigation techniques individually, and then practice using them within different scenarios.
For example, let’s say I have a stale cache problem. I can take four of my bricks and arrange them as such:
Finding logs
SSHing into a box
Purging caches
Rolling back a bad deploy
And I’ll quickly solve my stale cache problem. None of these techniques were created specifically to handle an incident with a stale cache problem, but because we’re looking at these techniques as individual bricks, we can put these together as a set to resolve an incident quickly. When we practice each of our techniques individually and look at them as independent bricks, we can rearrange them on the fly and get creative with how we solve problems.
The Bricks are Boring - but Important
The bricks can be boring. It’s why we don’t normally do them. You don’t often practice finding logs or rolling back a deploy. It’s just not a thing we do in our industry; but that’s not really how anything that wants to perform at a high level works.
I did a lot of marching band when I was younger. I did Drum Corps International - and my co-founder did as well. We did things in marching band that were really boring. One thing we did all the time was stand in line, march forward, put our horns up, march forward eight steps, and put our horns down. We’d do just this for three hours straight.
But that was important - that was a brick. Marching in step, with the right technique, holding our horns correctly. These were all individual techniques that we were required to master. The only reason we did that was so when we took the field we could do things like this.
If you look at each individual in the video, they’re marching and running in step, using the same techniques that they practiced for hours on end. They’re able to create different shapes at different speeds, and create amazing shows just like this.
Practicing & Improving Your Techniques Through Chaos Engineering
First, you need to identify which techniques lack understanding. Like our examples of chaos engineering earlier, you need to identify your weakest points first in order to strengthen them with practice. So, let’s inject some chaos into our environment.
This is a process I performed at Namely - we’d intentionally break a process and watch how our team would fix it, documenting all the issues they met along the way. Here’s my process:
Pick someone on your team that will break an environment intentionally.
Decide how you’re going to break it with them, and keep it a secret.
Schedule time on the calendar for when you’re going to break the environment.
Observe how the team identifies and mitigates the issue.
Keep track of all the bottlenecks they hit.
Of course, we told the team we were going to break the environment - we just didn’t tell them how. When the time came, we broke the environment and watched our team identify the problem and attempt to mitigate it. Along the way, they hit walls and bottlenecks. They might have to take an alternate route to their solution, or they might come to a stop because they lack the permissions or credentials to do something. A runbook might not exist for the problem, or the capability might not even exist in the system itself to solve the problem. In breaking down your system, you’re also breaking down your process. With what you’ve found, you can easily identify which of your techniques need improvement or replacement.
Creating Something New
Now that you’ve broken down your process, you can go back through your postmortems or retrospectives and find the individual techniques that people used to resolve the incident. Break these down - make them molecular. Something as simple as rolling back a deploy or finding a log could all be considered their own technique.
A great way to put your newfound techniques to the test is to find your newest teammate, and ask them to do it. Let’s use rolling back a deploy as an example.
Roll out a benign change to an environment.
Get on a Zoom call with a teammate and record it.
Ask the teammate to share their screen.
Tell them to roll back the deploy.
Watch what happens.
Step five is our most important. Like I mentioned earlier, each person is going to have a completely different way of handling a situation - even for something as simple as rolling back a bad deploy. One person might push a new image to a registry, while another might revert a commit on github. You might even find someone that has an extremely efficient way to handle this specific incident that you didn’t know how to do. Now, you can put that technique out into your team and handle incidents with much greater efficiency.
You can do this with any issue, whether it’s finding logs or purging caches. The goal is to identify all the different things that can slow you and your team down during incident response, and trim down the processes that hamper your response time.
Perfecting Your Incident Response with the Principles of Chaos Engineering
The end goal of all of these processes is to identify the core techniques people use during incident management through chaos engineering, practice those techniques, and then perfect them. Typically, we think of chaos engineering as breaking the system and then seeing the adverse effects of the break. We rarely, however, use the same technique to test our processes. Once you find the techniques that people are using during our chaos experiments, write them down - get them on paper, put them in whatever system you’re using so that you can institutionalize that knowledge. Then, practice them regularly. Practicing the individual techniques all the time, just like I did in marching band, is what prepares you for the show - the real moments during incident management where every second matters. You’ll be able to shape your response on the fly with a massive library of techniques that you’ve perfected through constant practice, application, and chaos engineering.
Prefer to watch the video? Check out Bobby’s original presentation here
See FireHydrant in action
See how our end-to-end incident management platform can help your team respond to incidents faster and more effectively.
Get a demo