The importance of right-sizing your retro

Here’s my stance: the incident response process isn't complete until a retrospective occurs.

The retro is an essential step in the incident response process usually taken after resolution with the goal of giving responders a space to process their experience, understand the incident’s causes, and improve the response effort itself, as well as the software we build.

Despite their usefulness though, we’ve found that not all teams hold retros after every incident, perhaps thinking they’re not worth the time or effort, especially on lower-severity incidents. In fact, in our Incident Benchmark Report, an analysis of 50,000 incidents resolved on the FireHydrant platform, we found that on average, retros are performed after about 29% of low-severity incidents and 42% of high-severity incidents. The interest is growing though — we also saw a 236% increase in the monthly average number of retros per company, between 2021 and 2022.

So what if we just made retros easier to complete? If your team skips retros, reframe your thinking and consider right-sizing them so the retro effort level is commensurate with the severity of the incident. Ditch the one-size-fits-all process to ensure that this important step is held at the end of every incident. Here’s how to make it happen.

Why skipping the retro isn’t an option#why-skipping-the-retro-isnt-an-option

First, let’s just get this out of the way: skipping the retro really shouldn’t be an option. Incidents provide a unique opportunity to discover valuable insights into how your product, processes, and people operate under pressure, and retros provide the space to solidify and disseminate this knowledge. They give teams the time to turn learnings into investments in reliability.

Retros typically generate action items for teams to implement, and the timeliness of the meeting allows these changes to get put into place hopefully before another incident occurs. I’d actually go as far as saying, “a retro is not complete until all follow-ups have also been completed.” Incidents are often expensive. So ensuring that the learnings from them are not only captured but also implemented is important.

In addition, retros can increase transparency and build trust — both internally and externally. Many companies (including us) post public-facing incident summaries to show how they addressed an incident and highlight how they will prevent the same error from occurring again, which helps build confidence — and buy grace — among customers. Retros also give your internal non-engineering teammates an opportunity to see how your team approaches incidents, which can be a learning experience for everyone. In fact, Snyk holds a monthly meeting of their Incident Response Guild where they review incidents; the meeting often brings upwards of 100 attendees.

Ways to right-size your retros#ways-to-right-size-your-retros

So you know you need to have retros to invest in the resilience of your process, people, and products, but how do you walk the line between enough and too much? Retros can get costly — and maybe that’s okay for SEV1 incidents. In those situations, you might want to have a wide swath of your engineering team and leaders in the room (especially if you’re having high-severity incidents frequently). But when you’re taking high-cost employees away from other high-value tasks, you want to make sure it’s worth it.

Think about minimizing costs and the demands on people’s time by optimizing your retro process to better fit your team’s needs.

Have them when customers are impacted. If your customers felt pain, you should seek to understand the cause and impact.
Institute asynchronous retros. Some retros — like ones for severe incidents — might necessitate a big sit-down meeting, but on lower-severity incidents, maybe you can get the same input through a shared document. In a world where “it could’ve been an email,” think about whether or not a live conversation is necessary to get the input of your team.
Don’t bloat the invite list. The people who responded to the incident are the most important people in the retro — consider everyone else optional. Track who participated in an incident (or let a tool like FireHydrant do it for you) and use this information to build your retro invite list.
Don’t cancel the retro if someone can’t attend. Hard to get everyone’s schedules aligned for a retro? Work around them. If a critical member of the team can’t make it, ask them to share their input directly with you or async in the retro report doc. You can deliver their answers to the rest of the group.
Share your findings. Finally, make sure you share the retrospective report after the retro has occurred. Share the report in your incident response channel, and share it with the wider company as well. That way, everyone — including those who may have wanted to attend but could not make it — has access to the incident summary. This is a great way to help senior stakeholders stay in the loop without including them in every retro.

How FireHydrant runs retros#how-firehydrant-runs-retros

At FireHydrant, we schedule retrospectives for incidents SEV2 and above within 48 hours of resolution. We schedule those meetings for 45 minutes — any longer, and we risk hitting meeting fatigue. For lower-severity incidents, we ask all participants to add their thoughts async to the retro doc in our FireHydrant account within the same time frame.

Our timeline of 48 hours is intentional: We’ve found that people feel rushed when presented with less time to plan a retro. But if more time is offered, like extending the deadline to 72 hours after an incident, the retro is more likely to fall by the wayside. Enough time goes by that the incident no longer feels like a priority. Plus, engineers are human — no matter how important an incident felt at the time, people’s memories fade the more time passes.

Regardless of how we run the retro, we always include these questions:

What was the full timeline of the incident?
What was the impact on customers?
What went well?
What did we learn?
What can we improve?

Every company should determine its own set of preferred questions, but I do recommend ending by asking for areas of improvement. You can use the answers to that question to quickly put in follow-up tickets, ensuring your retro leads to a positive change.

Make your retros work for you#make-your-retros-work-for-you

The goal of holding retros is to ensure your team captures the important takeaways that come out of every incident — not checking a box just to say you had a meeting. Treat retros as a designated time to allow responders to decompress, share lessons learned, and create action items to optimize your incident response plans and system health.

If you don’t currently schedule retros after every incident, try out some of our tips above and start small — hold one for incidents where customers are impacted. From there, you can continue to tailor your retros to meet the needs of your team.