Best practices for building an incident management plan and process
A thoughtful incident management plan can help you avoid future security incidents and cut down your incident response time drastically.
By Robert Ross on 4/5/2022
When incidents inevitably occur in your software stack, managing them well could be the difference between losing customers and building trust with them. In this article, we’ll give you and your team some best practices on how to prepare for managing incidents. It’s crucial to define service ownership, a declaration process, and practice all of it. With a little planning now, you'll be able to cut your incident response time drastically.
Building an incident management plan
Most teams have a plan for how to solve known technical problems in their systems (like scaling their Kubernetes clusters) but many teams haven’t taken the time to plan for how to respond to unknown problems. While it’s difficult to predict when and what incidents will occur, we can put into place the steps to take when an incident is declared. We’re planning so we can avoid ad-hoc management of our incidents which oftentimes prolongs them. You’ll have a hard time shaking off a poorly managed incident — and you’ll remember it all too well.
We’re going to talk about a few necessities that enable exceptional incident management.
- Service ownership
- Incident roles
- The incident declaration process
- Running incident drills
Let’s get started.
Assign service ownership
You might think of your system as individual services and software, but your customers think about it as functionality and features they need to get something done. Service ownership is the next generation of managing the complex systems that power software. Implementing service ownership requires adhering to these traits:
- The team owns the lifecycle of service: development, CI/CD, operations, and incident management. “You build it, you run it” is oftentimes the mantra of service ownership.
- Each service has a team assigned as its owner. Third party software (such as a managed database) also has ownership assigned.
- Each service owning team is on-call and responsible for any incidents their software encounters.
A critical step to successful incident management is achieving a basic level of service ownership. For example, you may have a team called “core services” that provides functionality such as authentication and authorization. If your users begin to experience problems with logging in to your website, the core services team should immediately know without a doubt that they own that incident.All product owning teams should be trained and enabled on managing incidents.
An up-to-date service catalog becomes essential when assigning team ownership, as you can clearly outline large product surfaces, which services power that surface area, and which team owns each service in that graph. For example, at FireHydrant we have a team that owns each of our Incident Management, Service Catalog, and Integrations product areas, and they’re all assigned accordingly in our own tool.
Clarify incident roles
Great incident management necessitates responders having well defined roles. An incident responder should clearly know what role they’ve been assigned and what their responsibilities are. Commonly, incident roles are adopted from FEMA’s incident command system (ICS).
We recommend the chart below for a guide on what roles you should establish in your incident management process, and what the general responsibilities of those roles are.
Incident role | Responsibilities |
---|---|
Commander | Designated leader that should have a bias for action. They are the foreperson of the incident. Making note of important events that are occurring such as rollbacks, resource increases, etc. Assigning other teammates incident roles |
Mitigator | First and foremost responsible for mitigating the incident. Being very communicative about what they are doing, and why. |
Planning | Observing the incident and creating follow up action items. Getting what responders need, from food to clearing calendars. |
Communication | Posting regular updates externally and/or internally about the incident, even if there’s nothing new to report. |
An incident should always have an Incident Commander role assigned. It’s common to have the person that declared the incident instantly enrolled as the incident commander. It’s also a role that should decompose into the other three roles as an incident unfolds. For example, if I declare an incident, I’m also putting on the Incident Commander hat at the same time. For a critical incident (SEV1 or SEV2), I will be tasked with assigning other roles (such as mitigators) as fast as possible.
Document the declaration process
An incident declaration has wide reaching implications for an organization. The same way you call the fire department the same way any time you smell smoke, we must also have a well-known first reaction to incidents in our systems.
Who declares incidents?
In an organization with a modern incident management practice and culture of reliability, every employee should have the ability to declare an incident. In a culture where incident management is seen as a valued part of the software development lifecycle, democratizing incident declaration results in faster response times, improved communication, and greater accountability. Additionally, a fast-moving internal culture of incident declaration lessens the likelihood that surfacing outages or service degradation falls on your customers.
How declarations happen
Incidents are unpredictable and different each time and therefore declaring a new incident takes many different forms in the wild. In many organizations, there’s a mix of these all being utilized.
- Manual everything: Each step from a customer notifying your team about a new incident to creating a Slack channel for triaging is manual.
- Human-in-the-loop: Automatic alerting catches an incident and notifies a team about the new disruption, allowing the team to manually declare an incident and run a process.
- Automated everything: Automated alerting notifies the team about an incident while allowing them to declare an incident from the alert and run an automated declaration process.
Which incident process do you use?
The number of services you have in production is often a good starting point for determining the scope of your incident management process. Smaller organizations often require fewer processes for incident management. As organizations get larger, or, if they’re dealing with more complex software, we find it useful to look at the number of services run in production to gauge the complexity of the system. You can use this chart as a starting point for understanding where you may sit in the incident process journey:
Services in production (including databases, load balancers, etc) | Complexity | Process Summary |
---|---|---|
1-5 | Low | Singular process for all incidents |
6-30 | Medium | A process for “critical” and “non-critical” incidents |
30-100 | High | Every severity has a defined process |
100+ | Critical | Every severity and product area has a defined process |
Depending on the scale and focus of your company, you may need to split your incident management process up by severity, impacted functionality, or another method. We recommend erring on the side of minimizing process complexity. Creating several processes for incident management when you’re running fewer than 5 services in production may confuse people more than it will help because at that size it's likely that a single affected service will have a critical impact on your customers.
Operating with a single process
If you have a system with only a few services, you can likely use a single process for all incidents. The most important action during an incident is to gather the right stakeholders together in a chat channel or conference bridge, diagnose, and mitigate. It’s common that one person will have multiple roles assigned to them, such as commander and mitigator.
Sample incident management runbook for a single-process organization:
Step | Purpose |
---|---|
Page the on-call engineer | If someone hasn’t already been paged, bring in an engineer to help mitigate the issue |
Open an incident channel | Triage and mitigation collaboration |
Create an incident ticket | Tracking purposes (commonly necessary for SOC and other compliance) |
Notify an internal audience | A communicative incident management process builds trust with the entire company |
Assign roles to incident responders | Clear alignment of responsibilities prevents crossed wires |
A simpler process is a good starting point that you can modify as you graduate to a more comprehensive process as your system complexity grows.
Dividing by critical and non-critical
As you reach a medium amount of complexity, say 6-30 services in production, you’ll likely need to split processes by “critical” and “non-critical” incidents. Using the severity guide above, it’s common to deem SEV1 and SEV2 and critical incidents, and SEV3, SEV4, and SEV5 as non-critical incidents.
Critical incident declaration process
Step | Purpose |
---|---|
Page the on-call engineer | If someone hasn’t already been paged, bring in an engineer to help mitigate the issue |
Open an incident channel | Triage and mitigation collaboration |
Update the public status page | Communicating with customers early and often builds trust |
Create an incident ticket | Tracking purposes (commonly necessary for SOC and other compliance) |
Notify an internal audience | A communicative incident management process builds trust with the entire company |
Assign roles to incident responders | Clear alignment of responsibilities prevents crossed wires |
Assign a list of known tasks to responders | Task lists help keep people focused and reduce cognitive load, reducing mistakes |
Communicate regularly | Provide updates every 30 minutes about critical incidents |
Non-critical incident declaration process
The process for non-critical incidents is the same as the single process runbook, this allows you to naturally graduate processes as your system becomes more complex.
Step | Purpose |
---|---|
Page the on-call engineer | If someone hasn’t already been paged, bring in an engineer to help mitigate the issue |
Open an incident channel | Triage and mitigation collaboration |
Create an incident ticket | Tracking purposes (commonly necessary for SOC and other compliance) |
Notify an internal audience | A communicative incident management process builds trust with the entire company |
Assign roles to incident responders | Clear alignment of responsibilities prevents crossed wires |
Process by severity
A process by severity is when you are operating at a substantial amount of complexity (30+ services). They’ll undoubtedly have overlap in their processes, though. Refer to the chart below to get an idea of each severity and their declaration process.
SEV1
- Open an incident channel
- Create an incident ticket (either in Jira, Shortcut, etc)
- Page the on-call engineer
- Notify the #incidents channel
- Post a public notification to customers on a public status page or via email list
- Assign a team to the incident
- Assign a list of known tasks to responders
- Remind the incident specific channel to post updates every 15 minutes
- Notify executive stakeholders
SEV2
- Open an incident channel
- Create an incident ticket (either in Jira, Shortcut, etc)
- Page the on-call engineer
- Notify the #incidents channel
- Post a public notification to customers on a public status page or via email list
- Assign a team to the incident
- Assign a list of known tasks to responders
- Remind the incident specific channel to post updates every 15 minutes
SEV3
- Open an incident channel
- Create an incident ticket (either in Jira, Shortcut, etc)
- Page the on-call engineer
- Notify the #incidents channel
- Assign a team to the incident
SEV4 and SEV5
- Open an incident channel
- Create an incident ticket (either in Jira, Shortcut, etc)
- Assign an individual to the incident
Running incident management drills: planned vs unplanned
All of the effort to set up service ownership, incident roles, and your declaration process is moot if you don’t actively practice your process. Teams that practice and optimize the incident management process during low-pressure moments are likely to run smoother and more efficient incidents when the stakes are high. You and your team should make time on your calendars quarterly, if not monthly, to practice managing a fake incident.
One important note about running drills is that it is not necessary to actually break a system in order to see how your teams react. It is perfectly fine, if not better, to practice as if there were an actual incident. Here’s a few reasons why:
- The stress of inducing a real incident, either by Chaos Engineering, or simply turning off nginx, will prevent the people you are training from absorbing the muscle memory needed.
- An induced incident on staging will inevitably cause a disruption in your engineering org, and you may unintentionally start another independent incident management process by another team that is unaware of the drill.
How to run an incident management drill
There are two ways we recommend you perform incident management drills: planned and unplanned. During a planned drill, the entire engineering organization knows there is a drill at a specific time and what it will include. During an unplanned drill only a few members of the team know what’s going to happen. Those team members instigate the incident, typically on staging. For unplanned drills, we recommend communicating that_something_ will happen during a given time window and no more.
Unplanned incident management drills
Unplanned incident drills are difficult to execute and often cause a great deal of stress. Teams should use caution when executing unplanned drills as they can damage the culture of trust and confidence necessary for effective incident management.
If you do need to run an unplanned incident drill, we recommend spending time proactively communicating to team members why unplanned drills are part of your process and what outcomes you hope to achieve. It's extremely important that team members feel as though they are allowed to safely fail during an unplanned incident drill in order to learn and grow in their role and be better prepared for the real thing.
Note: Unplanned incident drills are less detailed in their expected outcome as they are less structured.
Planned incident management drill
Before you start your planned incident, align your team on the critical details. Here’s a chart you can use to guide the alignment:
Question | Plan |
---|---|
What are we going to break, and why? | Ex: User login on staging |
Who is responsible for responding to our planned incident? | Ex: Core Services team |
What day and time are we planning on breaking the system? | Ex: March 29th, 9am EDT |
How are we going to break the system? | Ex: Dropping the oauth2 database that heimdall uses for token generation and storage |
Why are we running this drill? | Ex: To train Core Services on how to declare an incident and mitigate effectively together while taking scrupulous notes. |
Who is responsible for causing the planned incident? | Ex: Bobby Tables |
Once you’ve filled in the details of the planned incident drill, distribute the document to all of the engineering and relevant stakeholders that may be impacted by the blast radius at least a week beforehand.
Private stakeholder plan
Below is the plan that is shared only between a few people such as the person causing the planned incident, the manager of the responding team, etc.
Question | Plan |
---|---|
What are we going to break, and why? | Ex: User login on staging |
What team are we expecting to respond to the incident? | Ex: Core Services team |
Who are we expecting to be assigned the commander role? | Ex: Nick Narh |
Who are we expecting to be assigned the mitigator role? | Ex: Shray Kumar |
What day and time are we planning on breaking the system? | Ex: March 29th, 10:30am EDT |
How are we going to break the system? | Ex: Dropping the oauth2 database that heimdall uses for token generation and storage |
Why are we running this drill? | Ex: To exercise the response time and management of Core Services |
Who is responsible for causing the planned incident? | Ex: Bobby Tables |
Public stakeholder plan
Below is a plan that is shared with the relevant teams to notify them that an incident will occur on a certain date between 9am and 3pm.
Question | Plan |
---|---|
What time range are we planning on breaking the system? | March 29th, 9am-3pm EDT |
Why are we running this drill? | Ex: To exercise the response time and management of the incident |
What are we going to break, and why? | Top Secret |
What team are we expecting to respond to the incident? | Top Secret |
Who are we expecting to be assigned the commander role? | Top Secret |
Who are we expecting to be assigned the mitigator role? | Top Secret |
How are we going to break the system? | Top Secret |
Who is responsible for causing the planned incident? | Top Secret |
Incorporate incident retrospectives: learning
After you’ve concluded a drill, run an incident retrospective as if you experienced a real incident! You’ll likely learn a lot about how people experienced the incident by holding an hour-long meeting with all of the stakeholders. In addition, holding a meeting afterward is a great way to shift your organization towards quick retrospectives, which we recommend all companies do. Here are a few questions you can use with your team after a drill:
- What went well?
- What did we not expect?
- What didn’t go so well?
- What changes are we going to make to do better next time?
Next steps
Incident management and service ownership are essential practices that every organization can master with the right combination of planning and communication. By instituting service ownership and strong processes that are rehearsed with regular drills, your company and customers will have more trust in everything you do.
See FireHydrant in action
See how our end-to-end incident management platform can help your team respond to incidents faster and more effectively.
Get a demo