The Incident Benchmark Report
What 50,000 incidents reveal about the state of incident management
Although we’ve been managing incidents for as long as we’ve been building software, incident management as a dedicated practice is still in early stages. So regardless of company size, industry, or SRE headcount, a common question prevails…
What’s normal, anyway?
After analyzing data from 50,000 resolved incidents, we have an answer.
Dive into the largest analysis of incidents to date and discover trends that can serve as a blueprint for improving your practice.
This report is based on 53,034 incidents resolved on the FireHydrant platform between 2019 and 2022.
Data points have been anonymized and adjusted to ensure that no one company or set of incidents skewed the results.
In the details
The when and what of incidents
Incidents by company size
Size matters when it comes to the average number of incidents. We found a large difference in the number of incidents between small- and medium-sized companies and larger ones.
Incidents by day and time
Most of the incidents we analyzed occurred mid-week — on Tuesdays, Wednesdays, and Thursdays — between the hours of 11 a.m. and 2 p.m. ET. On the other hand, the least likely times for an incident to occur was between 7 and 9 p.m. and on the weekends.
But there was one day and time that stood out among the rest when it came to the likelihood of an incident occurring: Wednesday at 1 p.m. ET
Incidents per Hour
Incidents by severity level
We found that low-severity incidents took the lion’s share when it comes to incident breakdown by severity across company size.
Sev4 + Sev5
Sev3 + Unset
Sev1 + Sev2
The average mean time to resolution (MTTR) across all incidents was just over 24 hours. We were surprised to find that there wasn't a large difference in MTTR between high-severity incidents and low-severity incidents — just 30 minutes.
Average time to resolve incidents
The who and how of incident response
Responders and roles
Although the average responder team size varied based on incident severity — with 8 responders on high-severity incidents and 5.75 on low-severity ones — we found that there’s a magic number when it comes to responders.
MTTR increased by 18% when the number of responders jumped up by even one responder — and that’s across all severities.
But it’s not enough just to have the right number of responders on the team — they need to understand their job during the incident. We found that assigning roles to responders during high-severity incidents made a sizable improvement in MTTR.
decrease in MTTR
when roles are assigned
Key takeaway? It’s not just about getting the correct number of people in the room, it’s about ensuring that they understand what’s expected of them during an incident. Document the roles and expectations for your incident response process, then make sure everyone understands the requirements before an incident occurs.
When teams use a service catalog, they’re able to more quickly bring in the subject matter expert or owner of the affected service during an incident. No surprise here — the incidents that had services attached saw a decrease in MTTR.
decrease in MTTR when a service catalog is used
Key takeaway? Similarly to roles, we found that it’s not just about getting the right number of people in the room, it’s about getting the right level of expertise in the room. When you attach services to your incident response plan, you can do this faster, ultimately making a noticeable difference in MTTR.
We were surprised to see that across incidents of the same severity level, a conference bridge didn’t decrease MTTR or have a major effect on the number of chat messages sent.
Average number of chat messages
Incidents with a conference bridge attached vs not
26hrs 56mins MTTR
24hrs 10mins MTTR
25hrs 9mins MTTR
22hrs 52mins MTTR
Key takeaway? Focus on chat during the incident. In fact, many teams choose to create a channel per incident and use it as an artifact for the retro. If you do choose to use a conference bridge, be selective about who you bring in and be clear about what is or is not happening during the response effort. Mid-incident isn’t the time to start talking about long-term improvements.
When it came to how often retrospectives were held, there was some work to do. More teams held retros for high-severity incidents than lower ones, but even then, we see lots of room for improvement.
high-severity incidents that completed a retro
low-severity incidents that completed a retro
Key takeaway? We have a long way to go as an industry when it comes to regularly holding retros, but we think they’re a valuable tool in the quest for reliability. Holding retros is a surefire way to kickstart learnings from your incidents, which you eventually invest back in your systems.
What can we expect in 2023?
More lower-severity incidents
We saw a large increase in the number of incidents overall but an especially high increase when it comes to low-severity incidents. As incident management becomes about not just more quickly resolving incidents but also learning from them, more teams are being mindful of catching all of their incidents, not just the major ones.
107% more high-severity incidents
163% more low-severity incidents
Put it in practice: Lower-severity incidents can give you a temperature check on the health of your internal systems, helping you identify small problems before big ones occur. Consider creating a new “investigation” severity level that gives responders the space to document and research a low-impact issue without sounding all the alarms.
We saw a mega increase over the course of 2022 in the number of services created. We think this is a reflection of the rise in “you build it, you own it” mentality that bodes well for incident management. The faster you can get the right people in the room, the faster you can resolve.
increase in the number of services created
Put it in practice: The ultimate goal here might be a fully fleshed out service catalog that includes dependencies, owners, and links to operation documentation. To start though, keep it simple — declare ownership around product areas. Each product area should have an engineering team associated with it, and those teams should be trained on your incident response process. Set up your process so when an incident is declared, and you find out what’s broken, a member of the corresponding team is alerted.
We also spotted a big year-over-year jump in the number of retrospectives that teams average per month. We think that’s tied to an increase in awareness of the value of incidents as learning opportunities.
year-over-year increase in average retros per month by company in 2022
Put it in practice: Contrary to popular belief, the retro isn’t only for high-priority incidents. By skipping the retro, you could be leaving insights about your systems, product, people, and processes on the table. Instead, consider right-sizing the retro for the incident. Incorporate lighter retros that can be done async or with a smaller team. And keep having them! A culture of learning isn’t built overnight.
More external updates
When you’re known for handling your tough moments well, you build trust among your customers. And based on the increase we saw in incidents with status pages attached and the number of updates posted to status pages, it looks like others are starting to feel the same way.
increase in incidents with a status page attached
increase in the number of updates posted to status pages
Put it in practice: For communication to be truly effective, it needs to be accounted for in your incident response plan. Get a status page if you don’t already have one, create communication templates, set a cadence you’ll stick to, document it all, and then set up reminders to send updates. It’s tempting to only concentrate on resolving the incident, but good communication buys you a lot of grace when things go wrong.
Join us February 8 for [Webinar] Proving ROI: How to evaluate and improve how you manage incidents. Learn what metrics you should monitor, common benchmarks, and how to show improvements and prove ROI.Learn more