I recently published a couple of blog posts about what happens when you invest in a thoughtful incident management strategy and three first steps to take to do so. What I’m getting at in these posts is that we need a shift toward proactivity in the software operators community. I’d wager most of the world is responding to incidents as they happen, and nothing more.
But if you have a product that requires uptime — and don’t we all — you can’t afford to think in terms of reactivity alone. Incident response plays an important role within an incident management system, but incident response on its own cannot address the varied systemic factors that affect overall reliability. And too many companies are still operating in this whack-a-mole-style incident response world.
The first step in understanding how to shift from incident response to incident management is to define what those terms mean.
What is incident response?
Incident response focuses on the steps taken to contain and recover in the event of an incident. It’s a reactive way of operating. You get a page and you jump into action to remediate the incident.
Response is what happens during the incident itself — just a sliver of incident management. I like to think of it as the set of technical processes required to analyze, contain, and remediate an incident. Incident response is like an EMT responding to a 911 call about an unconscious person: their goal is to get there fast and do what they can to get the patient stable.
Incident response is often led by the team that’s on call when a page goes out. They’re charged with leading the response with a goal of remediation. An incident response process might include:
Concentrating mainly on remediation and response is really common when it comes to incidents, so if it’s where you are, don’t worry. We hear from companies all the time who have found that this reactive way of operating just doesn’t work for long term success and they’re looking to move into incident management.
What is incident management?
Incident management is a proactive framework and strategy for anticipating, handling, containing, and preventing incidents. It’s the entire lifecycle — before, during, and after.
In our 911 metaphor from above, incident management isn’t just getting the patient stable — it’s determining what actually happened and why, then getting a plan together for recovery and long term health.
Incident management looks at the whole incident lifecycle and every system it impacts. It includes everything from planning before the incident to communication and coordination during the incident, to retrospectives and learnings after the incident.
In addition to an incident response component, incident management analyzes and maintains the systemic elements that enable incident prevention and response. These structural components include:
Analytics and retrospectives
As a holistic approach to complex systems, incident management has a wide breadth of stakeholders that can include the C-suite, engineers, and product managers, in addition to the on-call teams.
Depending on scale and complexity, a business may have any number of expert roles built into its incident management plan. Incident management can encompass roles across the organization, including DevOps and reliability engineering, but also legal, public relations, customer support, and others. Incident management involves a holistic team aimed at improving reliability for your product overall.
Why does it matter?
I’m hammering on these differences because they signify a strategic approach — or lack thereof — to how you think about your incidents. Both terms involve making decisions based on the information and resources you have at hand. But incident management works with a lot more data and time, which allows it to maintain the health of an entire system, rather than simply responding to emergencies.
Incident management is a long-term investment in the health and reliability of your product that puts the customer experience at the forefront. It’s a commitment to healthy systems, to learning from experiences, and to continuously improving.
Of course, like any other impactful strategy shift, greater investment often leads to more complexity — more people, more processes, more documents. But the investment is worth it when you think about the improvements you’ll be making in customer trust, product reliability, and engineer quality of life.
Plus, the right incident management tool can address those challenges by automating communication, creating consistent processes, keeping track of who owns what services, and integrating with the tools you already use. If you’re ready to think in terms of incident management, we’d love to help.