An incident management lens on DORA 2023
The 2023 DORA report has two conclusions with big impacts on incident management: incremental steps matter, and good culture contributes to performance. Dig into both topics and explore ideas for how to start making incremental improvements of your own.
The latest Google DevOps Research and Assessment (DORA) Accelerate State of DevOps Report 2023 came out last week. There were two specific conclusions we were happy to see and that align closely with the lived experience of those of us who’ve carried the pager:
- Even small steps toward reliability improvements matter.
- Cultural investment is essential to high performance.
This ebook will dig into both topics and give you some ideas for how to start making incremental improvements of your own.
For the uninitiated, the DORA report outlines findings about the habits of teams that exhibit high operational performance. It’s a treasure chest of insights and know-how for DevOps and SRE. You get the lowdown on where you stand compared to others and strategies for how to level up your reliability game.
One of the most notable parts of this year’s DORA report is the shift in Google's perspective on the return of investing in reliability. In previous years, the report suggested that you see substantial improvements in operational performance only after investing in significant SRE practices. This year's report has a different message: even smaller steps toward better incident management practices can yield tangible benefits.
Google is saying what those of us in incident management already know: progress isn’t necessarily linear on the way to high reliability, and what you invest compounds over time. As we continue to see more data from future DORA research, we’ll no doubt see even more about how quick wins stack up in the contexts today’s teams are working.
Another key takeaway is that building a culture of reliability and continuous improvement is not just a byproduct but a core element of achieving high operational performance. This recognition of culture as a crucial factor is particularly relevant for those in incident management and SRE roles. Culture impacts how teams respond to incidents, learn from them, and work together to implement long-term improvements.
As DORA noted, culture sets the tone for how teams approach their responsibilities and keeps everyone centered on those little things that add up to high performance in the organization. At the center of it all is a commitment to continuous improvement; teams that are open-minded, willing to learn, and shore up each others’ contributions have a healthy culture and are positioned for high performance.
Consider this the green light to implement smaller, manageable changes within your organization. Don't underestimate their impact; they can add up and create a more robust and reliable system. After working with many organizations of various sizes and maturity levels, we can say that we’ve seen positive results by implementing incremental steps in the areas below.
The first few minutes are where we have the most control in incidents. We call this assembly time, and it includes all the steps it takes to declare an incident, kick off communication channels, and get the right people into them. That’s when the real problem-solving can begin.
Clearly, it makes sense to have a straightforward single source of truth your team can follow, and the first step to that kind of documentation is to align on a process. Start with a list of tasks that must be completed in each incident. Make sure your source of truth lives in a place everyone has access to, like a company Wiki or an incident management tool. And then make a plan for who updates it and when.
From this starting point, you can build out processes over time that fit the needs created by different severities or services. This will take some forethought and more than one discussion. But — and I truly can’t stress this enough — it’s worth it. The fewer questions you’re seeking answers to during an emergency, the faster you can focus on putting out the fire and moving forward.
For more recommendations on taking steps to introduce automation to your incident management program, check out “3 ways to improve your incident management posture today.”
As you start documenting your plan, you’ll encounter tasks that must be completed in every incident. For example, maybe your process calls for starting a Slack channel no matter the severity of the incident or notifying a select group of stakeholders when certain criteria are met. These are those often rote tasks that need to be accomplished during an incident while you’re also trying to actually mitigate the incident. And therein lies the trouble.
The more decisions we have to make or tasks we have to remember to do while we’re actually responding to an incident, the more likely we are to make a mistake or forget something. But by automating as many of those rote response tasks as possible, you can reduce decision fatigue for you and your team and help them get to actually solving the problem faster (and with less stress). That incremental step also has the cultural benefit of creating a less stressful working environment.
*For more recommendations on taking steps to introduce automation to your incident management program, read “A better way: 3 incident response areas prime for automation.” *
We’ve found that The right people in the right roles will help you respond to and resolve incidents more quickly and efficiently. In fact, according to the Incident Benchmark Report, incidents with roles assigned had a 42% lower mean time to resolution (MTTR) than those that didn’t.
But what roles do you need to fill? And how? This is another area where you can take incremental steps toward a major game changer.
If you’re just starting with a formal incident management program or don’t have many dedicated resources to put toward incident response, start with a foundational team that includes the most critical roles for executing an incident response. When it comes to resolving incidents swiftly, less is sometimes more. These roles include the incident manager, incident responders, and stakeholders. In a smaller team, these roles are usually filled by the engineers or engineering managers on-call and may vary from incident to incident.
As your team grows, consider implementing additional roles that allow more specialization, which will also reduce cognitive overhead and stress, again contributing to a better work culture.
For more advice on incremental steps toward implementing roles, read “How to define roles for your incident response team.”
When your incident response process is centered around a service catalog, responders are able to more quickly pinpoint the service or functionality that’s down, bring in the team or experts, and then get to solving the problem faster. Saving even a few minutes can greatly impact decreasing the costs around incidents and outages, so having up-to-date service details at your fingertips can make all the difference.
In fact, the Incident Benchmark Report found that incidents with services attached to them had a 36% decrease in MTTR (mean time to resolve) compared to those with no services attached.
Of course, you don’t have to conquer everything “service catalog” at once. You can take incremental steps to implement and then mature your approach to service-based incident response.
If you’re building a service catalog from scratch, start by documenting your services, their owning and responding teams, contact details, repositories, documentation, and monitoring dashboards. You can still use a service catalog if you’re managing a monolith instead of microservices. Break down any monoliths by module, components, or product surface area. Each product area should have an engineering team, and those teams should be trained on your incident response process.
Once you’ve got the simple service details and dependencies out of everyone’s head, you can add valuable layers to your catalog. Add functionalities, like login or checkout, to the services that power them. Why by functionality? Because that’s how your customers think. They’re not concerned with what service is broken, they’re concerned that they can’t log in. This has the added benefit that more people in your organization can be involved in incidents without knowing the technical details of your system, which fosters a more collaborative culture of ownership.
For more information on incremental steps toward implementing a service catalog, check out “Create a service catalog that grows with you.”
Incidents are invaluable opportunities to learn about your team and see how existing processes, people, and products function when the pressure is on. Passing on the opportunity to learn from these priceless — and sometimes painful — experiences isn’t an option, and retrospectives are your chance to do just that.
But how do you go from ruh-roh to retros for all? Right-sizing retros is a strategy that allows you to align the amount of attention you give the retro to the impact of the incident. Think async retros for minor outages and fully staffed, live ones for major outages.
You might also consider implementing a practice program to help your team learn when the heat isn’t on. Run an incident response game with your team to practice what you know. Training sessions give responders and stakeholders a chance to see how prepared your organization is for the real thing.
For more information on promoting a learning culture in your organization, check out “Easy as 1, 2, 3: ways to start learning from incidents today.”
Although there aren’t any shortcuts in the data on high-performance engineering teams, there’s a healthy reminder to renew our commitment to continuous improvement.
This philosophy is helping organizations quickly find what works and what doesn’t. Iteration works and still works in 2023. Great teams are knowledge-sharing, attentive to signs of burnout, and serious about what their users need.
When you look at the report, think about how your team compares right now. And consider what practical steps you can take to strengthen your incident response. If you’d like to keep the conversation going, you can join our Better Incidents Slack community. Or, book a demo to see how FireHydrant can help make implementing these steps much easier.
See FireHydrant in action
See how service catalog, incident management, and incident communications come together in a live demo.Get a demo