Best practices for building an incident management plan and process

A thoughtful incident management plan can help you avoid future security incidents and cut down your incident response time drastically.

Robert Ross

When incidents inevitably occur in your software stack, managing them well could be the difference between losing customers and building trust with them. In this article, we’ll give you and your team some best practices on how to prepare for managing incidents. It’s crucial to define service ownership, a declaration process, and practice all of it. With a little planning now, you'll be able to cut your incident response time drastically.

Building an incident management plan

Most teams have a plan for how to solve known technical problems in their systems (like scaling their Kubernetes clusters) but many teams haven’t taken the time to plan for how to respond to unknown problems. While it’s difficult to predict when and what incidents will occur, we can put into place the steps to take when an incident is declared. We’re planning so we can avoid ad-hoc management of our incidents which oftentimes prolongs them. You’ll have a hard time shaking off a poorly managed incident — and you’ll remember it all too well.

We’re going to talk about a few necessities that enable exceptional incident management.

  1. Service ownership
  2. Incident roles
  3. The incident declaration process
  4. Running incident drills

Let’s get started.

Assign service ownership

You might think of your system as individual services and software, but your customers think about it as functionality and features they need to get something done. Service ownership is the next generation of managing the complex systems that power software. Implementing service ownership requires adhering to these traits:

  1. The team owns the lifecycle of service: development, CI/CD, operations, and incident management. “You build it, you run it” is oftentimes the mantra of service ownership.
  2. Each service has a team assigned as its owner. Third party software (such as a managed database) also has ownership assigned.
  3. Each service owning team is on-call and responsible for any incidents their software encounters.

A critical step to successful incident management is achieving a basic level of service ownership. For example, you may have a team called “core services” that provides functionality such as authentication and authorization. If your users begin to experience problems with logging in to your website, the core services team should immediately know without a doubt that they own that incident.All product owning teams should be trained and enabled on managing incidents.

An up-to-date service catalog becomes essential when assigning team ownership, as you can clearly outline large product surfaces, which services power that surface area, and which team owns each service in that graph. For example, at FireHydrant we have a team that owns each of our Incident Management, Service Catalog, and Integrations product areas, and they’re all assigned accordingly in our own tool.

Clarify incident roles

Great incident management necessitates responders having well defined roles. An incident responder should clearly know what role they’ve been assigned and what their responsibilities are. Commonly, incident roles are adopted from FEMA’s incident command system (ICS).

We recommend the chart below for a guide on what roles you should establish in your incident management process, and what the general responsibilities of those roles are.

Incident roleResponsibilities
CommanderDesignated leader that should have a bias for action. They are the foreperson of the incident. Making note of important events that are occurring such as rollbacks, resource increases, etc. Assigning other teammates incident roles
MitigatorFirst and foremost responsible for mitigating the incident. Being very communicative about what they are doing, and why.
PlanningObserving the incident and creating follow up action items. Getting what responders need, from food to clearing calendars.
CommunicationPosting regular updates externally and/or internally about the incident, even if there’s nothing new to report.

An incident should always have an Incident Commander role assigned. It’s common to have the person that declared the incident instantly enrolled as the incident commander. It’s also a role that should decompose into the other three roles as an incident unfolds. For example, if I declare an incident, I’m also putting on the Incident Commander hat at the same time. For a critical incident (SEV1 or SEV2), I will be tasked with assigning other roles (such as mitigators) as fast as possible.

Document the declaration process

An incident declaration has wide reaching implications for an organization. The same way you call the fire department the same way any time you smell smoke, we must also have a well-known first reaction to incidents in our systems.

Who declares incidents?

In an organization with a modern incident management practice and culture of reliability, every employee should have the ability to declare an incident. In a culture where incident management is seen as a valued part of the software development lifecycle, democratizing incident declaration results in faster response times, improved communication, and greater accountability. Additionally, a fast-moving internal culture of incident declaration lessens the likelihood that surfacing outages or service degradation falls on your customers.

How declarations happen

Incidents are unpredictable and different each time and therefore declaring a new incident takes many different forms in the wild. In many organizations, there’s a mix of these all being utilized.

  • Manual everything: Each step from a customer notifying your team about a new incident to creating a Slack channel for triaging is manual.
  • Human-in-the-loop: Automatic alerting catches an incident and notifies a team about the new disruption, allowing the team to manually declare an incident and run a process.
  • Automated everything: Automated alerting notifies the team about an incident while allowing them to declare an incident from the alert and run an automated declaration process.

Which incident process do you use?

The number of services you have in production is often a good starting point for determining the scope of your incident management process. Smaller organizations often require fewer processes for incident management. As organizations get larger, or, if they’re dealing with more complex software, we find it useful to look at the number of services run in production to gauge the complexity of the system. You can use this chart as a starting point for understanding where you may sit in the incident process journey:

Services in production (including databases, load balancers, etc)ComplexityProcess Summary
1-5LowSingular process for all incidents
6-30MediumA process for “critical” and “non-critical” incidents
30-100HighEvery severity has a defined process
100+CriticalEvery severity and product area has a defined process

Depending on the scale and focus of your company, you may need to split your incident management process up by severity, impacted functionality, or another method. We recommend erring on the side of minimizing process complexity. Creating several processes for incident management when you’re running fewer than 5 services in production may confuse people more than it will help because at that size it's likely that a single affected service will have a critical impact on your customers.

Operating with a single process

If you have a system with only a few services, you can likely use a single process for all incidents. The most important action during an incident is to gather the right stakeholders together in a chat channel or conference bridge, diagnose, and mitigate. It’s common that one person will have multiple roles assigned to them, such as commander and mitigator.

Sample incident management runbook for a single-process organization:

StepPurpose
Page the on-call engineerIf someone hasn’t already been paged, bring in an engineer to help mitigate the issue
Open an incident channelTriage and mitigation collaboration
Create an incident ticketTracking purposes (commonly necessary for SOC and other compliance)
Notify an internal audienceA communicative incident management process builds trust with the entire company
Assign roles to incident respondersClear alignment of responsibilities prevents crossed wires

A simpler process is a good starting point that you can modify as you graduate to a more comprehensive process as your system complexity grows.

Dividing by critical and non-critical

As you reach a medium amount of complexity, say 6-30 services in production, you’ll likely need to split processes by “critical” and “non-critical” incidents. Using the severity guide above, it’s common to deem SEV1 and SEV2 and critical incidents, and SEV3, SEV4, and SEV5 as non-critical incidents.

Critical incident declaration process

StepPurpose
Page the on-call engineerIf someone hasn’t already been paged, bring in an engineer to help mitigate the issue
Open an incident channelTriage and mitigation collaboration
Update the public status pageCommunicating with customers early and often builds trust
Create an incident ticketTracking purposes (commonly necessary for SOC and other compliance)
Notify an internal audienceA communicative incident management process builds trust with the entire company
Assign roles to incident respondersClear alignment of responsibilities prevents crossed wires
Assign a list of known tasks to respondersTask lists help keep people focused and reduce cognitive load, reducing mistakes
Communicate regularlyProvide updates every 30 minutes about critical incidents

Non-critical incident declaration process

The process for non-critical incidents is the same as the single process runbook, this allows you to naturally graduate processes as your system becomes more complex.

StepPurpose
Page the on-call engineerIf someone hasn’t already been paged, bring in an engineer to help mitigate the issue
Open an incident channelTriage and mitigation collaboration
Create an incident ticketTracking purposes (commonly necessary for SOC and other compliance)
Notify an internal audienceA communicative incident management process builds trust with the entire company
Assign roles to incident respondersClear alignment of responsibilities prevents crossed wires

Process by severity

A process by severity is when you are operating at a substantial amount of complexity (30+ services). They’ll undoubtedly have overlap in their processes, though. Refer to the chart below to get an idea of each severity and their declaration process.

SEV1

  1. Open an incident channel
  2. Create an incident ticket (either in Jira, Shortcut, etc)
  3. Page the on-call engineer
  4. Notify the #incidents channel
  5. Post a public notification to customers on a public status page or via email list
  6. Assign a team to the incident
  7. Assign a list of known tasks to responders
  8. Remind the incident specific channel to post updates every 15 minutes
  9. Notify executive stakeholders

SEV2

  1. Open an incident channel
  2. Create an incident ticket (either in Jira, Shortcut, etc)
  3. Page the on-call engineer
  4. Notify the #incidents channel
  5. Post a public notification to customers on a public status page or via email list
  6. Assign a team to the incident
  7. Assign a list of known tasks to responders
  8. Remind the incident specific channel to post updates every 15 minutes

SEV3

  1. Open an incident channel
  2. Create an incident ticket (either in Jira, Shortcut, etc)
  3. Page the on-call engineer
  4. Notify the #incidents channel
  5. Assign a team to the incident

SEV4 and SEV5

  1. Open an incident channel
  2. Create an incident ticket (either in Jira, Shortcut, etc)
  3. Assign an individual to the incident

Running incident management drills: planned vs unplanned

All of the effort to set up service ownership, incident roles, and your declaration process is moot if you don’t actively practice your process. Teams that practice and optimize the incident management process during low-pressure moments are likely to run smoother and more efficient incidents when the stakes are high. You and your team should make time on your calendars quarterly, if not monthly, to practice managing a fake incident.

One important note about running drills is that it is not necessary to actually break a system in order to see how your teams react. It is perfectly fine, if not better, to practice as if there were an actual incident. Here’s a few reasons why:

  1. The stress of inducing a real incident, either by Chaos Engineering, or simply turning off nginx, will prevent the people you are training from absorbing the muscle memory needed.
  2. An induced incident on staging will inevitably cause a disruption in your engineering org, and you may unintentionally start another independent incident management process by another team that is unaware of the drill.

How to run an incident management drill

There are two ways we recommend you perform incident management drills: planned and unplanned. During a planned drill, the entire engineering organization knows there is a drill at a specific time and what it will include. During an unplanned drill only a few members of the team know what’s going to happen. Those team members instigate the incident, typically on staging. For unplanned drills, we recommend communicating that_something_ will happen during a given time window and no more.

Unplanned incident management drills

Unplanned incident drills are difficult to execute and often cause a great deal of stress. Teams should use caution when executing unplanned drills as they can damage the culture of trust and confidence necessary for effective incident management.

If you do need to run an unplanned incident drill, we recommend spending time proactively communicating to team members why unplanned drills are part of your process and what outcomes you hope to achieve. It's extremely important that team members feel as though they are allowed to safely fail during an unplanned incident drill in order to learn and grow in their role and be better prepared for the real thing.

Note: Unplanned incident drills are less detailed in their expected outcome as they are less structured.

Planned incident management drill

Before you start your planned incident, align your team on the critical details. Here’s a chart you can use to guide the alignment:

QuestionPlan
What are we going to break, and why?Ex: User login on staging
Who is responsible for responding to our planned incident?Ex: Core Services team
What day and time are we planning on breaking the system?Ex: March 29th, 9am EDT
How are we going to break the system?Ex: Dropping the oauth2 database that heimdall uses for token generation and storage
Why are we running this drill?Ex: To train Core Services on how to declare an incident and mitigate effectively together while taking scrupulous notes.
Who is responsible for causing the planned incident?Ex: Bobby Tables

Once you’ve filled in the details of the planned incident drill, distribute the document to all of the engineering and relevant stakeholders that may be impacted by the blast radius at least a week beforehand.

Private stakeholder plan

Below is the plan that is shared only between a few people such as the person causing the planned incident, the manager of the responding team, etc.

QuestionPlan
What are we going to break, and why?Ex: User login on staging
What team are we expecting to respond to the incident?Ex: Core Services team
Who are we expecting to be assigned the commander role?Ex: Nick Narh
Who are we expecting to be assigned the mitigator role?Ex: Shray Kumar
What day and time are we planning on breaking the system?Ex: March 29th, 10:30am EDT
How are we going to break the system?Ex: Dropping the oauth2 database that heimdall uses for token generation and storage
Why are we running this drill?Ex: To exercise the response time and management of Core Services
Who is responsible for causing the planned incident?Ex: Bobby Tables

Public stakeholder plan

Below is a plan that is shared with the relevant teams to notify them that an incident will occur on a certain date between 9am and 3pm.

QuestionPlan
What time range are we planning on breaking the system?March 29th, 9am-3pm EDT
Why are we running this drill?Ex: To exercise the response time and management of the incident
What are we going to break, and why?Top Secret
What team are we expecting to respond to the incident?Top Secret
Who are we expecting to be assigned the commander role?Top Secret
Who are we expecting to be assigned the mitigator role?Top Secret
How are we going to break the system?Top Secret
Who is responsible for causing the planned incident?Top Secret

Incorporate incident retrospectives: learning

After you’ve concluded a drill, run an incident retrospective as if you experienced a real incident! You’ll likely learn a lot about how people experienced the incident by holding an hour-long meeting with all of the stakeholders. In addition, holding a meeting afterward is a great way to shift your organization towards quick retrospectives, which we recommend all companies do. Here are a few questions you can use with your team after a drill:

  1. What went well?
  2. What did we not expect?
  3. What didn’t go so well?
  4. What changes are we going to make to do better next time?

Next steps

Incident management and service ownership are essential practices that every organization can master with the right combination of planning and communication. By instituting service ownership and strong processes that are rehearsed with regular drills, your company and customers will have more trust in everything you do.

Weekly newsletter

No spam. Just the latest releases and tips, interesting articles, and exclusive interviews in your inbox every week.

Read about our privacy policy

Incident management is easier with FireHydrant

Create a free account