Exploring distributed vs centralized incident command models
In this blog post, we’ll talk about two incident management structure models — distributed and centralized, including the pros and cons of each, and examples of what each structure looks like in our community.
By Robert Ross on 8/8/2023
Recently in our Better Incidents Slack channel, there’s been some chatter around how people structure dedicated incident commanders at their company: distributed or centralized.
The way I see it, there are two types of commanders: the temporary, distributed role — a hat that an on-call engineer or an engineering manager puts on during an incident. Then there’s the centralized, full-time role, where someone is the designated incident commander (or one of a few) for all incidents.
Examining both approaches' advantages and drawbacks is crucial for understanding which strategy best aligns with your organization's unique needs and challenges. I think a combination of both approaches, distributed and centralized, is what most teams need.
What is distributed incident management?
In a distributed incident management model, anyone within a company can take on the role of an incident commander. This decentralized, distributed approach resembles how volunteer firefighters operate in small towns, where every citizen is responsible for fire mitigation.
Most companies employing this method have a system for determining who is the incident commander. In some companies, whoever is on call when the incident is detected is automatically the incident commander. The important thing is determining what fits for your team.
An example of an organization that uses this kind of structure is Snyk, a dev tool security company that uses FireHydrant to manage its incidents. Snyk’s company culture encourages ownership and independence, a philosophy that extends to incident management. Amir Mehler, SRE manager, calls their philosophy NoOps: “NoOps means you don’t have an ops team. You build it; you run it. Everyone carries a pager, and it’s part of their duty.”
What are the pros and cons of distributed incident management?
The flexibility of distributed incident management allows for a broader pool of responders beyond specific departments — a major benefit of distributed incident management. This approach enhances knowledge and agility in tracking incidents by leveraging diverse expertise and promoting collaboration.
In an environment where anyone can step into the role of incident commander, you have a pool of comprehensive knowledge, skills, and perspectives to draw from in an incident. This gives an organization dynamic and adaptable incident response capabilities that enable informed decision-making and collaboration.
At the same time, this type of incident management has some drawbacks. For example, in a distributed environment, managing incidents is most-often side work that interrupts engineers from their core job. Additionally, this type of incident response can include a lot of swivel-chairing, which can lead to nobody being able to give either incidents or their core position their full attention. Automating and standardizing service catalogs, runbooks, and retrospectives can help decrease some of the mental stress a distributed incident response framework can create.
What is centralized incident management?
A centralized approach to incident management, on the other hand, looks like pods of people you trust to handle your incidents, such as SRE, DevOps, or platform teams. In this structure, that's a big part of what incident commanders and response teams do — wake up and deal with incidents daily, from remediating them to refining incident response processes.
An example of a company that uses this structure is Recharge, a financial tech company that uses FireHydrant to streamline incident management. Recharge has two teams within its infrastructure team: DevOps and InfraOps, and the two work closely together and share the on-call burden. During business hours, the InfraOps team handles all of the triaging problems and incidents, but after hours or on weekends, they use an on-call rotation. Whoever is on call during the incident is the incident commander. “We understand the process really, really well,” the program’s head, Ryan Kish, told us. “And our goal is to shepherd anyone else in the incident through our process.”
Another example of this model is Avalara, where incident management sits in the Engineering Operations Center (EOC). The philosophy is that anyone can flag an issue to the EOC, who then serves as incident commander and can bring in other experts depending on the severity level and expertise needed.
What are the benefits and drawbacks of centralized incident management?
A centralized approach comes with a reality check: do you have enough incidents to justify the cost of a dedicated team? If the answer is yes, you can benefit from clear structure, accountability, and efficient decision-making during response efforts. By consolidating authority and coordination under a central command, you ensure effective communication, dedicated resource allocation, and consistent incident management practices.
However, this approach also has its drawbacks; starting with a centralized approach is, by nature, a siloed approach. That can mean independent (rather than cross-department) tool budgets and headcount that are subject to scrutiny. And there’s the added challenge that any cross-functional position brings — communication, buy-in, etc.
Is a distributed or centralized incident command model better?
Which model is better depends on your company’s current needs, resources, and maturity around incident management. I think that as a company grows and can centralize incident management, it should. By centralizing your program, you have more dedicated resources and domain expertise over incident management, which allows you to continuously improve your processes, hold engineering-wide training, and simply get better and faster because you have a headcount dedicated to doing so.
Of course, if you’re not there yet, there’s nothing wrong with a distributed model. Being a fairly small team, that’s what we use at FireHydrant. But even with a distributed model, we ensure through automation and runbooks that every responder follows the same process, that we make time to train everyone, and that we’re all invested in improving.
A service-driven approach is one that effectively melds these two. You have an incident commander and a service owner for each incident. The commander is the on-call responder when the incident is detected, and the first thing they do is pull in the manager, who is the expert on the affected service or functionality.
In short, it’s less about how you structure your organization and more about what you do with that structure, though. If you’re following incident management best practices, you can be successful with either. Interested in learning more about what those best practices are? Join our community.
See FireHydrant in action
See how service catalog, incident management, and incident communications come together in a live demo.Get a demo