Use incident cycle time to optimize your incident response process
Cycle time is how long it takes to move through each phase of an incident, from declaration all the way through resolution. Cycle time can help you evaluate the health of your incident response efforts and make improvements in the areas you can control.
By Jouhné Scott on 5/31/2023
Although the causes and solutions for incidents vary widely, most incidents follow a similar timeline from declaration to resolution. We call the period of time it takes to move from one phase or milestone of an incident to the next cycle time.
Before you roll your eyes at yet another incident metric, let me tell you how this is different. Cycle time isn’t about tracking something new, it’s about breaking down your incident lifecycle into distinct areas. Doing this can help you evaluate the health of your incident response efforts — I’m talking about the things during an incident that you can control, like how quickly you get service owners into the room, how you communicate with internal and external stakeholders, and even how you generate your retrospective.
By digging into cycle time, you can find out where you’re spending (or wasting) time on areas that could be streamlined, templated, or even automated completely. This all has the additional benefit of reducing the risk of burnout for responders. The longer the team is stuck in one phase, the more it wears on them. By finding ways to speed up your cycle time, you can keep everyone moving quickly, minimize rote work, and continuously improve.
Okay, let’s talk about how.
Start tracking cycle time
The first thing to do is define and document your phases and milestones — these will generally remain consistent throughout your incident response process, regardless of what the technical problem actually is.
Phases can most simply describe the before, during, and after an incident, while milestones are more specific events. We use the below phases and milestones in our incident response framework at FireHydrant. Each incident moves through these phases and milestones in the same order every time.
Impact started: The affected system began having problems.
Detected: A monitoring system or human noticed the system was having problems.
Acknowledged: The person responsible for responding to incidents within the affected system acknowledged the monitoring system's page.
Declared: The incident management process began.
Investigating: The first concrete step toward identifying a fix or remediating the problems with the affected system occurred.
Identified: The issue was understood, and work to mitigate the problem began.
Mitigated: The system stopped exhibiting customer-impacting issues and a solution was introduced (Note: that solution may or may not be durable; additional engineering work may be required to resolve the issue.).
Resolved: The issue stopped impacting customers and the solution was durable.
Retrospective preparation: Team began preparing and gathering additional data about the incident
Retrospective completed: Retrospective document completed, any additional follow-on retrospectives or other meetings are done, and any follow-up items were identified.
Your phases and milestones may differ, but the important thing is to document the major moments of an incident. Once you’ve defined your phases and milestones, the next step is tracking the time spent moving from one milestone to the next. You can do this using time tracking software, with a timer on your phone and a Google sheet, or by using your incident management tool.
For example, FireHydrant customers can export MTT* stats and incident timeline data to use in generating reports — the milestone timestamps within an incident can help you determine how long each phase lasted. Once you have enough data to determine averages for these, you can figure out the average length of each phase within your overall MTTR.
Depending on how many steps are in your process, you may have many cycle time metrics to look at. You can decide which ones are the most important to you and start tracking just a few cycle times, like those between phases, or just open the firehose and start tracking everything. The important thing is to consistently gather data on how long each step of the response process takes.
Analyze and improve incident cycle time
After collecting data for a few weeks or months, you should know how long your team takes to complete each cycle. Compare your data against other teams in your organization, against historical data (if you have it), or incident benchmarks. From there, you can identify the areas of the response process you want to focus on improving. Questions you might consider:
Do we have the right people? If assembly time is running long because you can’t find who owns what, consider creating a service catalog. You’ll be able to find and bring in the internal experts on the degraded functionality more quickly. By automating your service catalog, you can pull in owners the second an incident is declared by tying monitoring integrations to incident response tooling.
Do people know what they're supposed to do? If you want to improve the declaration phase of your response process, make sure roles and responsibilities are clearly assigned.
Is communication clear? During the response phase, communication can feel distracting. While you’re trying to focus on mitigation, you also have to keep stakeholders updated and make sure knowledge transfer happens smoothly in each stage of the process. The more you can automate your communication tasks, like creating a timeline or updating status pages, the more focused you can be on mitigation.
Do we have the right tools? Especially in the declaration and assembly phases of an incident, there’s a lot you can do to make small improvements that add up to a lot of saved time. Make sure your tools are helping you automate incident response processes and not causing delays.
Control what you can
Cycle time is a way to help identify areas of your incident response process that need extra work or could benefit from automation. By doing what you can to streamline and speed up everything you have control over, you’re able to more quickly get to and focus on the parts you have less control over.
Minimizing cycle time is only one way to improve incident response. For more ways to streamline your process, check out our ebook 3 ways to improve your incident management program in 2023.
See FireHydrant in action
See how service catalog, incident management, and incident communications come together in a live demo.Get a demo