Easy as 1, 2, 3: Ways to start learning from incidents today
Incidents provide an unparalleled opportunity to learn about your people, processes, and products under pressure. In this post, we’ll tell you how to ensure your team isn’t letting these opportunities for learning go to waste.
By Jouhné Scott on 3/2/2023
A mindset shift is necessary when it comes to incidents. Yes, they are inconvenient, they can be cumbersome and disruptive. But they also provide an unparalleled opportunity to learn about your people, processes, and products under pressure — and that’s an opportunity you can’t pass up.
We know the appetite for learning is there. In our Incident Benchmark Report, an analysis of 50,000 incidents resolved on the FireHydrant platform, we found that although only 42% of high-severity incidents and 29% of low-severity incidents had an associated retrospective, that number is rising quickly. We saw a 236% year-over-year increase in average retros per month by company in between 2021 and 2022.
So how do you ensure your team isn’t letting these (sometimes expensive) opportunities for learning go to waste? In this post, we’ll talk about three ways you can start on the path toward improvement today.
If you’re not already documenting timelines, notes, and thought processes during your incidents, now is the time to start. Time gets fuzzy during an incident, and we can’t rely on memory alone. Documentation provides an excellent way for the team to review what happened after an incident.
Documentation doesn’t need to be elaborate. The most basic format is simply a timeline of what happened during the incident process. In our triage incident types, for example, we just use the incident Slack channel as a stream-of-conscious note document. I’ve also seen this done in shared docs, or you can use a tool like ours to automatically collect all of that information for you. The important thing is to ensure you’re capturing details like screenshots from observability tools, a summary of events, a timeline of milestones achieved, and any other relevant information.
It’s important to gather this data during the incident so it doesn’t get lost, but a good next step is to take the time to flesh out additional details after the incident is mitigated. As you review, you can also highlight important aspects to dig into during the retro. Ultimately, this documentation might serve as the base for a retrospective artifact, or incident retrospective document, which is typically created during the retro and lives in a place where team members can refer to and revisit it later.
Retrospectives should be psychologically safe spaces where responders reflect on the incident, capture valuable insight, and determine any action items needed. Even when things go well, there’s something to learn from the process of reflection, and if you’re in the habit of having retrospectives on a regular cadence — and in a blameless way — you will find iterative improvements to make over time. Creating a positive feedback loop and work environment will increase productivity and collaboration, and lead to an ideal outcome of an engaged and trusted team.
Some teams put off retros because they think they’re time-consuming. If you’re new to retros or are facing a cultural obstacle in getting buy in, start small. As you mature your practices, you can continue to right-size your retro by scaling it up or down based on the severity of the incident, the amount of people involved, and your goals.
So where do you start? Ideally, a retro should involve all key responders. Commit to doing a 20-minute retro with an SRE retro format for a three-month time commitment and see how it goes. If you find a dedicated meeting for each incident is too demanding, bundle them together or meet asynchronously through a shared doc.
If you see success though, consider expanding the practice. For example, at Snyk they have monthly meetings of a group called the Incident Response Guild. The meeting is open to anyone for the purpose of discussing — and learning from — incidents. Whatever cadence of format you ultimately adopt, the goal is to develop and document recommendations for maturing your incident response process and the reliability of your systems. Here’s an example incident retrospective FireHydrant created based on an incident we experienced in June 2022.
A third way to ensure you’re not letting key learnings about your response efforts themselves, as well as the reliability of your systems overall is by measuring data around incidents. You can’t determine how to improve if you don’t know your starting point, right? Measurement gives your team a benchmark for how well they are performing and can highlight areas of improvement and growth.
Again, this is an area where you can start small and mature over time. Even if you’re not ready to collect extensive analytics, measuring anything can be useful. For example, for our Incident Benchmark Report, we chose mean time to resolve (MTTR), which measures the length of time between when the incident was declared and when it was resolved, as our starting point metric.
Another one to consider is mean time to respond (also MTTR 😅), the length of time it takes from when the problem is identified to when the mitigation effort begins. This number can tell you a lot about whether you're burning time and money on rote tasks and toil. And, since it's an area that's ripe for automation, it's a great place to start eliminating costs. Whatever metrics you choose to focus on, have your team ask themselves, how can we improve this metric? Discuss frequently, particularly during retros, and track progress over time. This focus will ensure that your team is constantly demonstrating growth and improvement.
Learn together, improve together
As you mature your incident management process, use existing incidents as a springboard for learning how to be more effective as a team — and ultimately, how to build more reliable products.
And team is the key word here. Learning from incidents should always be a team activity. Singling out a single responder for extra training can lead to blame or heroic behavior. Ultimately your goal is to be the most effective team possible, so keep focusing on that.
See FireHydrant in action
See how service catalog, incident management, and incident communications come together in a live demo.Get a demo