How to define roles for your incident response team
Assigned roles help teams spring into action after an incident is declared. This post explores what roles you should consider for your response plan, based on the size of your organization and your team’s incident response maturity level.
By Carissa Zukowski on 3/21/2023
Agility matters in incident response, and the easiest way to spring into action is by having a well-defined team in place ahead of time. The right people in the right roles will help you respond to and resolve incidents more quickly and efficiently. In fact, we found in the Incident Benchmark Report that incidents with roles assigned had a 42% lower mean time to resolution than those that didn’t.
But what roles do you need to fill? The answer depends on the size of your organization and your team’s incident response maturity level. Your team should be comprehensive but also the right size for your capabilities. Here are three approaches to defining incident response roles based on maturity level.
If you’re just getting started with a formal incident management program or you don’t have many dedicated resources to put toward incident response, start with a foundational team that includes the most critical roles for executing an incident response. When it comes to resolving incidents swiftly, less is sometimes more. These roles include the incident manager, incident responders, and stakeholders. In a smaller team, these roles are usually filled by the engineers or engineering managers on-call and may vary from incident to incident.
Sometimes called the incident lead, the incident manager owns the entire incident response process. The incident manager is often a team lead or someone with a decent amount of institutional knowledge. At many companies, the person who declared the incident adopts this role by default, but the incident manager could be anyone on the team with sufficient knowledge to manage the task.
Incident manager primary duties may include:
Assemble the responding team, assign tasks, and track progress
Provide direction to mitigate the incident at hand
Communicate updates at large and work with others to make sure that the right stakeholders are brought into the loop
Determine how to handle anything else that comes up during the incident
Often, as organizations are just getting started, the incident manager is both a player and a coach as the primary incident responder. They are responsible for all tasks until they assign them to other people. They don’t need to be a domain expert, but they need to understand the context around the incident and know who to turn to for more resources.
This is an umbrella term for every responder who contributes to mitigating incidents. Often, the responding team is composed of members of whatever team the affected service sits under. However, some companies default to senior or experienced team members or a centralized group of responders that are deployed during an incident, often as part of an on-call rotation to ensure someone is always available.
Incident responder primary duties may include:
Work on mitigating and resolving the incident (as opposed to organizing the team, like the incident manager does)
Communicate what they are doing and why to the incident manager and other members of the responding team
Complete other duties as assigned by their incident manager
Response teams can be tailored to the type of incident. For instance, if the incident is purely technical or declared regarding an issue in the staging environment, the responders would likely only include engineers and product managers. More significant incidents could also have customer-facing roles and other team members who have context around how the incident impacts specific customers or features (we’ll discuss some of those roles below).
Stakeholder is a general term for people invested in the outcome of an incident who need to be kept informed. Your internal stakeholders will vary based on your company, the severity of the incident, and how much it impacts customers. However, common internal stakeholders include leadership team members, product managers, customer-facing teams, and engineering leaders.
Internal stakeholder primary duties may include:
Receive updates about the response efforts on a timely basis
Provide resources or assistance as requested
Provide relevant information to external stakeholders, like customers, in some cases
The incident manager should facilitate communication with these stakeholders or appoint someone to own those communications. As appealing as it might be to focus solely on mitigation and skip the updates, don’t. Good communication builds trust and buys grace during an incident.
Growing response team
As your process matures, you’ll gain a better understanding of your organization's specific needs for responding to incidents. That clarity will help you tailor your response roles based on the incident at hand versus taking a one-size-fits-all approach. Here are some roles you might consider adding as your approach to incident management matures.
The incident commander is a more robust version of the incident manager with more expansive duties. An incident commander is accountable for the entire process and duration of the incident. In contrast, an incident manager is primarily focused on mitigating the incident.
Incident commanders can either be volunteers from within the organization or they can be appointed to their role. Depending on your organization’s process, this could be a domain expert or a general project/process manager. Regardless of how you choose, this is typically an ongoing additional responsibility that high-performers take on to increase their exposure to stakeholders with the hopes of going into leadership.
Incident commander primary duties may include:
Get the right people in the room together and appoint other roles as needed
Remove blockers to mitigation and help the team work toward resolving the incident
Organize communication with stakeholders
Minimize risk across the company
Support incident resolution tasks, including documentation, post-incident review, and working with others accountable for process improvements
Some companies assign the role of incident commander to all engineering managers. In contrast, others invest in a full-time incident commander.
Incident management lead
While the incident commander owns real-time incident response, the incident management lead is responsible for preparing the organization for incidents ahead of time. They may designate roles and train individuals in those roles, set incident milestones, and document all processes.
Incident management lead primary duties may include:
Work across departments to ensure all relevant people receive training on the incident management and response process, roles and responsibilities, and expectations
Make sure SLAs are understood and communicated to ensure the team knows how to best respond to an incident and/or any process updates
Coordinate training and gamedays to run through practice incidents
The incident management lead might be a domain owner, like someone in engineering or product leadership. Or if your company has a designated risk mitigation team, platform engineer team, or site reliability engineering (SRE) team, a representative may act as incident management lead.
Customer success lead
The customer success lead is a liaison between customer-facing teams and the incident response team roles. They act as a shield to keep the response team focused and a resource to provide information to customer-facing stakeholders.
Customer success lead primary duties:
Advise customer-facing teams on the situation and timeline for resolution
Pass relevant details from customers and customer-facing teams back to the response team
Handle communications with high-profile or heavily impacted customers
Typically, this person is a customer success team member with a good working relationship with engineering. In some cases, it might also be a member of the engineering team.
Robust response team
An experienced and well-provisioned incident response program may have room for even more specialized and advanced team roles that may only be needed for some incidents but can be particularly useful for high-profile or public incidents. But beware of scope bloat; build your response team to include only the most relevant people to the task.
The communications lead disseminates information to internal and external stakeholders. This role becomes increasingly essential with major incidents that require public communication.
Communication lead primary duties may include:
Post regular updates (externally and internally) about the incident
Create public-facing communications about the incident in conjunction with customer success, legal, etc.
Liaise with stakeholders and facilitate flow of information in both directions
The communication lead is typically someone with extensive experience in written communications, possibly from customer support or, if the company has one, the public relations department.
In some organizations, the role of the communications lead role may be split into two if a large incident requiring complex communication occurs: an internal communication lead and a public company spokesperson.
The investigative lead collects and analyzes information from past incidents and makes recommendations to help prevent future ones. In some companies, they’re also called the problem manager or root cause analysis manager.
Investigative lead primary duties may include:
Collect and analyze data to find the root cause of the incident
Make recommendations for future incident prevention
Your investigative lead should have a firm understanding of operations related to the incidents at hand so they can ask the right questions and make knowledgeable recommendations. Sometimes the incident commander doubles in this role.
Occasionally, incidents affect sensitive information that could lead to potential legal repercussions for the company. The legal advisor is a legal expert who can advise on responsibilities or liabilities related to an incident.
Legal advisor primary duties may include:
Ensure legal compliance during incidents and surface any violations to the legal team
De-escalate events to prevent contractual or legal breaches
Work closely with the communications lead to ensure a proper amount of information is shared at the right time with the public, especially during a large incident
Usually, this is someone on the company’s legal team who has at least a basic understanding of the technical aspects of your software.
Choose a winning incident response team
Knowing what incident response team roles you need to fill is half the battle. The other half is getting the right people on board. In choosing your team, consider team members' bandwidth, availability, physical location, communication skills, and technical prowess.
If the role is critical, consider having backups so you know who to go to if someone is out. Consider how to do handoffs if it’s a different time zone or if someone needs to step out. Additionally, ensure you're not relying on the same person for all incidents; use these specific roles as opportunities to train others in your incident management process and as growth opportunities to work with other stakeholders.
See FireHydrant in action
See how service catalog, incident management, and incident communications come together in a live demo.Get a demo