The Incident Benchmark Report

What 50,000 incidents reveal about the state of incident management

Although we’ve been managing incidents for as long as we’ve been building software, incident management as a dedicated practice is still in early stages. So regardless of company size, industry, or SRE headcount, a common question prevails…

What’s normal, anyway?

After analyzing data from 50,000 resolved incidents, we have an answer.

In the
Report

Dive into the largest analysis of incidents to date and discover trends that can serve as a blueprint for improving your practice.

  1. 01

    In the details: The when and what of incidents

  2. 02

    Response ready: The who and how of incident response

  3. 03

    Trend spotlight: What can we expect in 2023?

Data Download

This report is based on 53,034 incidents resolved on the FireHydrant platform between 2019 and 2022.

Data points have been anonymized and adjusted to ensure that no one company or set of incidents skewed the results.

App Dynamics logo Snyk logo Shipt logo Asana logo
App Dynamics logo Snyk logo Shipt logo Asana logo
Greenlight logo
Greenlight logo

In the details

The when and what of incidents

Incidents by company size

Size matters when it comes to the average number of incidents. We found a large difference in the number of incidents between small- and medium-sized companies and larger ones.

  • 10/month

    Small

    0-599 employees

  • 22/month

    Medium

    600-2499 employees

  • 49/month

    Large

    2500-6000 employees

  • 37/month

    Enterprise

    6000+ employees

Incidents by day and time

Most of the incidents we analyzed occurred mid-week — on Tuesdays, Wednesdays, and Thursdays — between the hours of 11 a.m. and 2 p.m. ET. On the other hand, the least likely times for an incident to occur was between 7 and 9 p.m. and on the weekends.

  • 1,845

    Sun

  • 9,225

    Mon

  • 9,931

    Tue

  • 11,170

    Wed

  • 10,127

    Thu

  • 8,490

    Fri

  • 2,246

    Sat

But there was one day and time that stood out among the rest when it came to the likelihood of an incident occurring: Wednesday at 1 p.m. ET

Incidents per Hour

Incidents by severity level

We found that low-severity incidents took the lion’s share when it comes to incident breakdown by severity across company size.

  • Low: 42%

    Sev4 + Sev5

  • Medium: 31%

    Sev3 + Unset

  • High: 27%

    Sev1 + Sev2

The average mean time to resolution (MTTR) across all incidents was just over 24 hours. We were surprised to find that there wasn't a large difference in MTTR between high-severity incidents and low-severity incidents — just 30 minutes.

24
Hrs
05
Mins

Average time to resolve incidents

Response ready

The who and how of incident response

Responders and roles

Although the average responder team size varied based on incident severity — with 8 responders on high-severity incidents and 5.75 on low-severity ones — we found that there’s a magic number when it comes to responders.

6

Responders

MTTR increased by 18% when the number of responders jumped up by even one responder — and that’s across all severities.

But it’s not enough just to have the right number of responders on the team — they need to understand their job during the incident. We found that assigning roles to responders during high-severity incidents made a sizable improvement in MTTR.

42%

decrease in MTTR
when roles are assigned

Key takeaway? It’s not just about getting the correct number of people in the room, it’s about ensuring that they understand what’s expected of them during an incident. Document the roles and expectations for your incident response process, then make sure everyone understands the requirements before an incident occurs.

Service catalog

When teams use a service catalog, they’re able to more quickly bring in the subject matter expert or owner of the affected service during an incident. No surprise here — the incidents that had services attached saw a decrease in MTTR.

36%

decrease in MTTR when a service catalog is used

Key takeaway? Similarly to roles, we found that it’s not just about getting the right number of people in the room, it’s about getting the right level of expertise in the room. When you attach services to your incident response plan, you can do this faster, ultimately making a noticeable difference in MTTR.

Communication preferences

We were surprised to see that across incidents of the same severity level, a conference bridge didn’t decrease MTTR or have a major effect on the number of chat messages sent.

Average number of chat messages

  • 61 messages

    Hi sev

    with bridge

  • 67 messages

    Hi sev

    without bridge

  • 30 messages

    Lo sev

    with bridge

  • 37 messages

    Lo sev

    without bridge

Incidents with a conference bridge attached vs not

Hi sev

with bridge

without bridge

63%

37%

26hrs 56mins MTTR

24hrs 10mins MTTR

Lo sev

with bridge

without bridge

60%

40%

25hrs 9mins MTTR

22hrs 52mins MTTR

Key takeaway? Focus on chat during the incident. In fact, many teams choose to create a channel per incident and use it as an artifact for the retro. If you do choose to use a conference bridge, be selective about who you bring in and be clear about what is or is not happening during the response effort. Mid-incident isn’t the time to start talking about long-term improvements.

Retrospectives

When it came to how often retrospectives were held, there was some work to do. More teams held retros for high-severity incidents than lower ones, but even then, we see lots of room for improvement.

42%

high-severity incidents that completed a retro

29%

low-severity incidents that completed a retro

Key takeaway? We have a long way to go as an industry when it comes to regularly holding retros, but we think they’re a valuable tool in the quest for reliability. Holding retros is a surefire way to kickstart learnings from your incidents, which you eventually invest back in your systems.

More lower-severity incidents

We saw a large increase in the number of incidents overall but an especially high increase when it comes to low-severity incidents. As incident management becomes about not just more quickly resolving incidents but also learning from them, more teams are being mindful of catching all of their incidents, not just the major ones.

  • 107% more high-severity incidents

  • 163% more low-severity incidents

Put it in practice: Lower-severity incidents can give you a temperature check on the health of your internal systems, helping you identify small problems before big ones occur. Consider creating a new “investigation” severity level that gives responders the space to document and research a low-impact issue without sounding all the alarms.

More services

We saw a mega increase over the course of 2022 in the number of services created. We think this is a reflection of the rise in “you build it, you own it” mentality that bodes well for incident management. The faster you can get the right people in the room, the faster you can resolve.

1640%

increase in the number of services created

Put it in practice: The ultimate goal here might be a fully fleshed out service catalog that includes dependencies, owners, and links to operation documentation. To start though, keep it simple — declare ownership around product areas. Each product area should have an engineering team associated with it, and those teams should be trained on your incident response process. Set up your process so when an incident is declared, and you find out what’s broken, a member of the corresponding team is alerted.

More retros

We also spotted a big year-over-year jump in the number of retrospectives that teams average per month. We think that’s tied to an increase in awareness of the value of incidents as learning opportunities.

236%

year-over-year increase in average retros per month by company in 2022

Put it in practice: Contrary to popular belief, the retro isn’t only for high-priority incidents. By skipping the retro, you could be leaving insights about your systems, product, people, and processes on the table. Instead, consider right-sizing the retro for the incident. Incorporate lighter retros that can be done async or with a smaller team. And keep having them! A culture of learning isn’t built overnight.

More external updates

When you’re known for handling your tough moments well, you build trust among your customers. And based on the increase we saw in incidents with status pages attached and the number of updates posted to status pages, it looks like others are starting to feel the same way.

136%

increase in incidents with a status page attached

366%

increase in the number of updates posted to status pages

Put it in practice: For communication to be truly effective, it needs to be accounted for in your incident response plan. Get a status page if you don’t already have one, create communication templates, set a cadence you’ll stick to, document it all, and then set up reminders to send updates. It’s tempting to only concentrate on resolving the incident, but good communication buys you a lot of grace when things go wrong.

Go deeper

Watch the [recorded webinar] Proving ROI: How to evaluate and improve how you manage incidents to learn what metrics you should monitor, discover common benchmarks, and how to show improvements and prove ROI.

Learn more