How to get started with incident management metrics

Tracking incident metrics can help you discover patterns in the causes and costs of incidents and help you understand brittle parts of your organization. We've seen them help teams zero in on things like:

What teams respond to the most incidents and are at the greatest risk for burnout
What services or functionalities are problem areas in your product that need attention
How effective your incident response process is in efficiently mitigating incidents

But it can be intimidating to get started. Do you really need metrics if you're a small team or just beginning to formalize your incident management program? I say yes. The key is to start with something manageable and grow. You'll get the benefit of that knowledge today while setting a good groundwork for growth.

Start simple and grow#start-simple-and-grow

If you’re just getting started with a formalized incident management program or just beginning to implement metrics, you may not be able to track nuanced metrics. That’s OK. Even just tracking the number of incidents over time gives you a baseline for improving your practice.

Start by tracking the number of incidents you have per day, week, or month, whatever timeframe means something to you. You'll need to track this number for most other metrics you may want to think about in the future, but it can also give you insights now. Once you have that down, start thinking about adding things like days and times when incidents occur, which teams staffed them, what services were affected. For example:

Start noting when incidents most often occur and then make sure you’re staffed up during those hours. Even better, find out why they’re occurring at that time — is it your deployment schedule or something more?
Track which services are most often affected by incidents. Do you need to ensure you have extra hands on deck during deployments? Think about creating service-specific runbooks to help address those incidents even faster. Tracking affected services can also help you fast-track improvements needed in your product.
Get an idea for which employees are most often involved in incidents — are these folks risking burnout or suffering from the Shaq effect?

Once you have established a metrics program for your incidents, you can move toward tracking MTTR and various other components of your cycle time to get a more nuanced picture of where you can make process improvements.

Level up with MTT* metrics#level-up-with-mtt-metrics

Once you start tracking how your incident response efforts perform, it’s hard to stop. Uncovering patterns in how you find, declare, and respond to incidents surfaces new areas of improvement to help you uncover incidents and jump into action faster. And that time savings can add up to big bucks. According to the Uptime Institute’s 2022 Outage Analysis:

The number of organizations reporting failures resulting in at least $100,000 in total losses jumped from 39% in 2019 to 60% in 2022.
And companies are also going down for longer. Nearly 30% of major outages in 2021 lasted more than 24 hours, up from just 8% in 2017.

Mean time to detect#mean-time-to-detect

Mean time to detect (MTTD) is, on average, how long it takes from when things stopped functioning as expected and when someone discovers that something’s amiss.

Start by making sure you’re documenting that time period for each incident. Then, calculate MTTD by adding all the incident detection times for a given period and dividing that number by the number of incidents. For instance, if your total time to detect for January is 580 minutes, and there were 10 incidents reported, your mean time to detect would be 58 minutes.

Tracking this metric is important because it can reveal gaps in monitoring or thresholds in service-level objectives that may need adjusting.

Mean time to assemble#mean-time-to-assemble

Mean time to assemble (MTTA) is, on average, how long it takes from when an incident is declared and an alert goes out to when actual mitigation work begins. In many incident management programs, this includes the amount of time it takes to:

Assemble any necessary experts, like owners of affected services
Assign roles for the incident
Create communication methods like a Slack channel or Zoom bridge
Notify any internal or external stakeholders
Create any tickets, notes documents, or other reference pieces

Assuming you are documenting the amount of time all of this takes for each incident, you calculate MTTA by dividing the total assembly time in a specific period by the number of incidents over that time.

This metric is useful for tracking a team's responsiveness and the effectiveness of the alert system. It can also give you insight into where your process could benefit from automating mundane tasks like ticket and Slack channel creation, team assembly, and notifications (We can help you with that).

Mean time to mitigation#mean-time-to-mitigation

Mean time to mitigation (MTTM) is, on average, the amount of time between when an incident begins and when the system no longer exhibits problems to users. It’s the amount of time it takes to “stop the bleeding,” so to speak. However, the team is still monitoring the situation in many cases. For example, temporary fixes might be place, like a hotfix change or disabled job, but the team could be still working on a long-term fix.

Like the others, you get this calculation by combining that specific time period for all incidents in a given span, then divide it by the total incidents in that span.

This metric helps your team understand product or system failures to prevent them in the future. It also shows you where holes in your response process might be. For example, if you’re tracking both MTTA and MTTM, you might find that tracking service owners and documentation is prolonging assembly time, which is then prolonging MTTM. And this might drive an adoption of more robust service-driven response practices.

Mean time to resolve#mean-time-to-resolve

Mean time to resolve (MTTR) is how long it takes, on average, between when an incident is detected and when the problem is completely resolved, meaning there’s a long term fix in place. You probably get how it’s calculated by now.

MTTR can vary wildly depending on the incident, so we know it’s a controversial metric. But what it’s really measuring is how long it takes you to get back to stasis. If your program is advanced enough to have its own key measurements that work for your systems, that’s great. If not though, we think this is a good place to start.

We have seen that good incident response processes correlate to lower MTTR. According to the Incident Benchmark Report, an analysis of more than 50,000 incidents on the FireHydrant platform:

Average MTTR across company sizes and severity levels was just over 24 hours.
When you keep the response effort to a tight team that understands their roles, you resolve incidents faster. Six was the magic number of responders — MTTR increased by 18% when one additional person was on the responding team. And, when defined roles are used during an incident, MTTR went down by 42%.
Service-based incident management works. Incidents that had services attached to them saw a 36% drop in MTTR, compared to those that didn’t. And interest is rising. We saw a 1,640% increase in the number of services created over the course of 2022 (adjusted for company growth).

Next steps#next-steps

There are ways to fudge any numbers, so no metric is perfect. The main goal here is to provide yourself with a benchmark you can use to measure how your incident response efforts improve over time. Once you know your starting point, you can start examining how better incident response practices help you find, declare, and resolve incidents faster. For more on this topic, check out our on-demand webinar, Proving ROI: How to evaluate and improve how you manage incidents.