Too many of us are still playing whack-a-mole when it comes to incidents: an incident is declared, the on-call engineer is paged, the incident is resolved and then forgotten — until next time. It’s time to start thinking in terms of proactive incident management, not just reactive incident response.
Like I wrote about in a recent blog post, a thoughtful incident management strategy helps redirect focus from timely remediation and back onto shipping great products, growing customers, and retaining engineers. The best part? It’s easy enough to get started using what you already have. Let’s dig in on how.
Incident response vs incident management: what’s the difference?
I’ve heard these terms thrown around interchangeably. Before we talk about how to go from one to the other, let’s set definitions for what they mean.
Incident response is what happens during the incident itself. Incident response is one sliver of incident management. It’s the set of technical processes required to analyze, contain, and remediate an incident. Think about it like an EMT responding to a 911 call about an unconscious person: their goal is to get there fast, figure out what’s going on, and do what they can to get the patient stable.
Incident management is the entire lifecycle — before, during, and after. The patient is stable, the EMTs know they went into cardiac arrest, and now it’s time to figure out why and get a plan together for recovery (and hopefully prevention).
Incident management looks at the whole incident lifecycle and every system it impacts. It includes everything from planning before the incident, to communication and coordination during the incident, to retrospectives and learnings after the incident.
3 steps to go from incident response to incident management
You can see big gains from relatively small investments when it comes to maturing your incident management. And the fundamentals can be put in place without purchasing tools or hiring new staff (of course, if you want to really invest in your incident management strategy, think about automating where you can).
Set up a consistent process
The first thing you’ll want to do is set a process for what happens when an incident occurs. Once you have the basics of an incident management program in place, you can build on it.
To decide what goes into your company’s process, you’ll want to do some planning and scenario modeling. The first thing you need to do is get the right people involved. I recommend assigning one person to be in charge of forming a task force that includes a representative from each organization that might be touched. This will probably vary from company to company but should include:
Legal and finance
Sales and/or account management
Then, you’ll want to consider:
What, exactly, do you alert on? Are you alerting on service level obligation (SLO) violations only? Or every time a service goes down?
Who gets alerted and when? This will vary depending on the size of your company and the severity of the incident. For example, the marketing team probably doesn’t need to get involved in every incident, and legal might be a next-day alert.
How do you communicate? This includes not only getting the right people together at the same time in the same place during an incident — it also includes incident communication to the rest of the org and to your customers.
What’s the process for each team involved? The workflow for the engineering team will look very different from that of customer support.
What are the most likely scenarios? Of course, anything can happen (and will, judging by the past few years), but to keep this manageable, think about the most likely scenarios for your organization and model them out with a timeline and action items.
This will take some forethought and probably more than one discussion. But — and I truly can’t stress this enough — it’s worth it. The fewer questions you’re seeking answers to during an emergency, the faster you can focus on putting out the fire and moving forward.
Of course, this plan is only as good as its documentation. Write it down. Get buy-in from stakeholders. Store it in a place that everyone has access to, like a company Wiki or an incident management tool. And then make a plan for who updates it and when.
Declare ownership around product areas
If you already have some incident response processes in place, you probably have a service dependency framework of some kind — this will further help you determine who should be in the room when things go down.
The ultimate goal here might be a fully fleshed out service catalog that illustrates dependencies, owners, and rollback plans for every service. To start though, keep it simple — declare ownership around product areas.
Each product area should have an engineering team associated with it. Set up your process so when an incident is declared, and you find out what’s broken, a member of the corresponding team is alerted.
One of the reasons I like this is because you’re thinking about the incident the way a customer might. We call this pitchfork alerting and it helps put the focus on customer pain — when the influx of support messages comes in, customers don’t complain about a service being down, they say they can’t log in.
Run a drill
The absolute worst time to learn is when you’re dealing with a real incident. Once you have the basics in place, it’s time to practice. The more prepared you are during an emergency, the faster you can focus on putting out the fire and moving forward.
Force everyone to open a fake incident. Yes, everyone. Here’s the thing: your customer support team is already declaring incidents — they’re just currently doing it by popping into an engineering channel on Slack and telling you that something’s broken. Your CEO? They’re already declaring incidents too … sometimes by texting you directly at any time in the day. Anyone should be able to call the fire department, and anyone should be able to declare an incident. Part of the challenge is just getting everyone confident enough to do it, and there’s no better time than when you’re just running a drill.
After the drill, do a quick debrief. How’d it go? What did you learn about your systems or processes that you could improve for the future? It’s not a matter of if an incident will happen, it’s when. Make sure your plan is ready.
Start slow, iterate, improve
The key here is to bite off small chunks. Try taking a MVP approach to implementing these processes — get a skeleton plan together and then flesh it out over time and iterate on it.
If you’re early in this journey, don’t worry, you’re not alone. Many of the companies we talk to are right where you are, thinking about how to move from reactive incident response to a proactive incident management program that will pay dividends in employee morale, customer loyalty, cost savings, and product reliability.