Two data-backed ways to resolve incidents faster

Incidents are expensive — and only getting more so. In fact, more than 98% of large companies and 47% of small- and medium-size companies say a single hour of downtime costs at least $100,000, according to the 11th annual Hourly Cost of Downtime Survey.

That number includes lost revenue, of course, but there are also a number of hidden costs that are involved with incidents (especially when they’re not managed well), like lost employee productivity and morale, and damage to your reputation among customers. Clearly, there’s an advantage to reducing the amount of time it takes to resolve incidents.

With this in mind, we analyzed 50,000 incidents that were resolved on the FireHydrant platform for our Incident Benchmark Report with an eye toward uncovering behaviors with clear ties to faster incident resolution rates. While mean-time-to-resolution (MTTR) is a less than perfect metric, sometimes you have to work with what you’ve got. What we found were two clear ways to decrease MTTR:

Structure your response process around services
Clearly define roles during an incident

Structure your response process around services#structure-your-response-process-around-services

When your incident response process is centered around a service catalog, responders are able to more quickly pinpoint the service or functionality that’s down, bring in the team or experts, and then get to solving the problem faster. Saving even a few minutes can have a big impact on decreasing the costs around incidents and outages, so having up-to-date service details at your fingertips can make all the difference. So it’s no surprise that the Incident Benchmark Report showed that incidents with services attached to them had a 36% decrease in MTTR (mean time to resolve), across severities, compared to those with no services attached.

At its most basic, a service catalog is simply a list of internal and external technical services (enterprise applications, task-specific tools, microservices, APIs, and so on) used by your organization, and relevant details like owner, code location, and operational dashboards. By documenting this information, you help knock down knowledge silos and ensure everyone has the information they need to respond to incidents confidently — a big deal when you’ve just been paged at 1 a.m.

Depending on your needs and the maturity of your program, there are several ways to approach becoming more service driven, ranging from manual documentation to a full software catalog like Backstage or a service catalog specifically designed for incidents like ours. The more mature your program becomes, the more you can do with service catalogs, including introducing automation to them that further speed up response efforts.

For example, Avalara mapped a runbook to each service the company monitors, which helps the team get the right people in the right place at the right time faster, as well as document service-specific nuances and processes for the many applications monitored. When a service goes down, the corresponding runbook is triggered, and everyone jumps into action.

A robust service catalog is an essential tool in the overall incident management ecosystem and can significantly enhance your team’s productivity. It’s not surprising that our Benchmark Report found a 1640% increase in services created over the course of 2022.

Clearly define roles during an incident#clearly-define-roles-during-an-incident

Incidents can induce chaos and panic, taking time and energy away from swift incident response. This can lead to second-guessing, a higher likelihood of mistakes or duplicated work, and analysis paralysis. One way to limit this stress is by ensuring you have a documented plan for who needs to be in an incident and what they need to do.

We found in the Incident Benchmark Report that the incidents with the lowest MTTR had a magic number of responders — 6. MTTR increased by 18% when even one more responder was added. Likewise though, it’s not just having the right number of responders, it’s making sure they clearly understand their roles during the incident that makes the difference. We saw a 42% lower MTTR when roles were assigned across incidents.

How you structure roles and responsibilities will change based on things like team size and incident response maturity. Maybe you start with a set group or roles for all incidents and move to more specialization depending on affected area and severity level — and ultimately tie those roles to preconfigured runbooks and task lists in an automation tool like ours.

Take the next step#take-the-next-step

That 11th annual Hourly Cost of Downtime Survey I mentioned earlier? It also said that 87% of organizations now require a minimum of 99.99% availability. With user expectations like these, every second counts when you’re talking about incidents. Learn more about what teams are doing to improve their incident management practices in our new ebook: How to improve your incident management program in 2023.