Beyond Downtime: A Practical Guide to Minimizing Incident Costs

Introduction#introduction

Incidents are costly, but perhaps their true cost is hidden. Downtime immediately comes to mind, but the full cost plays out every day in opportunity cost and cultural drain, long after the incident itself is mitigated. These hidden costs exact their toll on companies where the processes, planning, and tooling are unprepared for the onslaught.

Maybe your organization is paying some of these costs right now. Building repeatable, guided response processes can help your team kick off incident response when necessary while simultaneously tackling explicit, opportunity, and cultural costs.

This book is your guide to the explicit costs of incidents, opportunity costs, and cultural drain caused by incidents. We’ll give you actionable, proven tips for mitigating risks and reducing costs. Along the way, we’ll share facts and statistics that illuminate the problem–and point toward a tangible solution.

A note on our research#a-note-on-our-research

It’s notoriously hard to track things like cultural drain and opportunity cost, which are two issues this ebook explores. Here’s how we did it: FireHydrant partnered with the analyst firm Enterprise Strategy Group (ESG) to create a report proving out the economic value of our tool. However, to find the economic advantage of adopting FireHydrant, ESG first had to determine what the total cost of incidents was. That data felt too valuable not to share.

ESG used market research, company interviews, and their own research base to create an analysis of the cost of incidents on a model company of 5,000 employees, with 1,500 engineers managing an average of 51 incidents per month as interrupt work. In other words, they are not dedicated incident commanders, they are engineers who “build it and own it.”

This model organization’s incident response tool is homegrown, and it takes 400 hours per year to manage, support, and maintain this developed solution. The company has an annual voluntary turnover rate for engineers of 10%, and it takes new engineers 48 weeks to be at full speed. During this ramp-up period, they are 65% as effective as a fully ramped engineer.

Now that we’re on the same page, let’s dig in.

Chapter 1: The explicit costs of incidents#chapter-1-the-explicit-costs-of-incidents

When an incident occurs in an organization, the initial response can either pave the way for a smooth and efficient resolution or lead to chaos and wasted resources. Unfortunately, many organizations still grapple with outdated incident management processes and lackluster documentation, causing unnecessary delays during the first critical moments.

As you can see below, respondents to a recent survey said some of their biggest challenges with incident management include responders who don’t know what to do, manual and time-consuming data collection, and inconsistent or lacking communication.

In this chapter, we'll explore the significance of optimizing incident assembly and handling, backed by valuable data from a collaborative study conducted by FireHydrant and Enterprise Strategy Group (ESG). We aim to reveal the costs incurred by incidents and show how adopting best practices can decrease these costs for businesses.

Section 1: Examining the explicit costs of incidents#section-1-examining-the-explicit-costs-of-incidents

A cold fact of SaaS life is that you can’t make money when your product or website doesn’t work — and those lost dollars add up fast. Downtime, SLA breach paybacks, compliance fines, and other hard costs are, by definition, the easiest to quantify and they’re what most people think of when they think about incidents. It’s not just about SLA breaches and customers not being able to access your product though.

For example, marketing campaigns and ad dollars that direct potential users to your site or product are wasted during an outage. And we also have the headcount expense of actually managing and mitigating the incident itself. When we’re talking about engineers, customer support, and/or account management (and in severe cases, executives, marketing, legal, etc.), the direct costs of an incident inflate quickly.

In this chapter, we’ll break it down.

Assembly time costs#assembly-time-costs

When an incident occurs ad-hoc processes and poor documentation consume precious resources at the beginning of the incident. For most organizations, it takes much longer than it should and is very time-consuming to assemble the right engineers, information, and processes to properly declare and begin the remediation process.

As a result, assembly of the right people, information, and processes within our composite organization, with manual, ad-hoc processes and poor documentation, can:

Take anywhere from 12 minutes for less-complicated incidents to more than 90 minutes for a SEV0 or complicated incident.
Involve 5 responders for a low-severity incident, to eight or more for a high-severity incident

Based on these numbers, we average that slow assembly time annually consumes 252 person-hours for low-severity, 1,094 person-hours for medium-severity, and 2,016 person-hours for high-priority incidents. That’s a total of 3,362 person-hours per year, or the equivalent of 1.8 full-time engineers.

__That means slow assembly time alone costs $236,000 in labor costs. __

Inefficient response effort costs#inefficient-response-effort-costs

Inadequate and inconsistent declaration leads to inefficiencies throughout the process, across engineering, customer support, and account management teams. More resources than needed are called to investigate and resolve the problem. Throughout the incident:

Step-by-step guidance is missing, leading to process missteps, blind alleys, gaps, and friction.
The incident and timeline are documented manually, a tedious consumption of engineering resources.
Hand-offs are manual and ad-hoc, slowing the process and wasting even more time.
Communication updates, including channel updates, stakeholder updates, and customer updates, are manual and inconsistent.
A lack of retrospectives and analytical analysis means that, too often, the same functions cause issues, the same inefficient plays are run, mistakes are repeated and inefficiencies remain.

This has a real annual cost, for the composite organization. Each incident consumes an additional 2 person-hours for low and 8 person-hours, per responder, for each low- to high-severity incident, respectively. If we add that to our assembly costs, we find poor incident response practices cost our model organization 10,530 person-hours per year, or the equivalent to 5.6 full-time engineers.

__That's a total of $740,000 in annual labor costs. __

MTTR and downtime costs#mttr-and-downtime-costs

Friction and gaps in the incident assembly and declaration process mean that the right people don't begin addressing the incident as quickly as possible, leading to delayed response at the most critical time. A lack of step-by-step guidance during the incident management process introduces incident resolution gaps and friction, further slowing the time to resolution.

A lack of learning means that the process doesn’t improve and optimize over time – a rinse-and-repeat cycle of degradation and downtime impacts the business.

Outages and service degradations from these delays lead to an impact on operations, customers, and the brand:

When a back-office application is impacted, employee productivity can be affected.
When it’s a product or e-commerce site, customers are directly impacted, and brand damage can occur, leading to a loss of referrals and future sales beyond the immediate revenue loss
When it’s the corporate website, brand perception can be impacted and sales indirectly lost.

And the outages and degradations can also result in a direct cost, with a breach in service level agreements (SLAs) with customers, leading to compliance fines and penalties.

Based on the severity of the incident, the estimated downtime impact for our composite organization is an aggregate across their operations, customers, and the brand, estimated to be an average cost per downtime hour of

Low: Sev4 + Sev5 = $5,000 / hour
Medium: Sev3 + Unset = $25,000 / hour
High: Sev1 + Sev2 = $100,000 / hour

Of course, when it comes to downtime impacts, not every incident has a downtime impact, the downtime impact usually affects only a portion of the operation and customer base, and hot fixes are usually put in place so operations and customers are up and running despite the average MTTR being 24 hours.

For modeling of our model organization, we will say that the actual downtown impacts range from 0.0% for low-severity incidents, to 0.2% for medium-severity incidents, and 1.0% for high-severity incidents.

Factoring in these downtime scopes and timing impacts for our composite firm, this adds up to 51 hours of annual unplanned downtime.

That's $4.4M in annual impact to business operations, customers, and brand.

Incident management tools#incident-management-tools

To help address some of the incident management challenges they are encountering, our organization builds their own Slack bot and keeps manual documentation around service owners and processes.

For our composite organization, we assume a small homegrown application developed to help guide the incident process along with maintaining and supporting shared spreadsheets and documents, with an annual cost of 400 annual person-hours and $22,500 in licensing and hosting costs.

That's $28,128 in annual maintenance and support labor costs.

The total explicit costs of incidents#the-total-explicit-costs-of-incidents

Delayed incident responses can clearly have severe implications for organizations. Once we add these costs together, we see that for our composite organization, the explicit cost of incidents nets out around $5.4M.

Section 2: Ways to lower the explicit costs of incidents#section-2-ways-to-lower-the-explicit-costs-of-incidents

There are a lot of costs associated with incidents that we can’t control. However, one of the things we can control is assembly time. And with optimized incident management practices and platforms, organizations often see a significant reduction in assembly time. Faster assembly time means you get to fixing the problem faster, which ultimately helps you stop the bleeding faster.

By doing things like assigning roles during incidents, structuring response efforts around service or functionalities, and automating tasks like updating status pages and kicking off communication, companies told ESG that they were able to reduce assembly time to mere seconds. By unlocking the door to quicker response times and eliminating unnecessary involvement, teams see substantial time and cost savings for organizations.

So how do you achieve these gains? To mitigate the costs and challenges associated with incidents, organizations can adopt a set of best practices to optimize incident assembly and handling like the ones below.

Well-defined incident response plan#well-defined-incident-response-plan

Establishing a clear and structured incident response plan is essential. Consistent processes and established ownership are key. Your plan should outline roles, responsibilities, and escalation procedures, enabling quick identification of the right personnel during an incident. Implement a clear prioritization system for incidents based on their potential impact, enabling effective resource allocation and prompt resolution of high-impact incidents.

Defined severity levels are essential to your response plan. They quickly get responders and stakeholders on the same page on the impact of the incident, and set expectations for the level of response effort — both of which help you fix the problem faster.

When the incident itself hits, sometimes responding feels like herding cats. Assigning roles and responsibilities during an incident is critical to minimizing confusion and ensuring that the right people are working on the appropriate tasks. An incident manager should own the entire incident response process and be prepared to coach your incident response team through.

As your process matures, you’ll gain a better understanding of your organization's specific needs for responding to incidents. That clarity will help you tailor your response roles and incident plan based on the incident at hand instead of taking a one-size-fits-all approach.

Effective communication channels#effective-communication-channels

Ensure open and efficient communication channels to promptly inform all stakeholders during incidents, reducing confusion and delays. Stakeholders in an incident include employees, customers, vendors, and the general public. Keeping stakeholders informed throughout the incident is crucial to maintaining their trust and confidence in your team and the organization.

This can be extended externally to customer status pages as well. Customer success should be considered a part of the response effort for any customer-facing incident. In high-severity incidents, you might have a CS representative as one of the assigned roles on your responder team.

Establishing consistent communication channels keeps everyone working on the incident informed, preventing delays caused by miscommunication. Effective cross-department communication is essential and can be particularly challenging during an incident. Developing communication protocols in advance ensures that everyone is on the same page.

Automation and tooling#automation-and-tooling

Leverage incident management platforms with automation and streamlined workflows to expedite incident assembly and handling, automate rote incident tasks, and allow teams to focus on efficient resolution. Automation can be accomplished by bringing in a tool or building in-house IFFT or bot technologies.

However, when thinking about building vs buying, think about the full picture. Scope the process like you’re building a product — because you are. In many cases, there are components that go into building and running an incident management tool that isn’t considered. These are the three most common:

Are you ready to build a product? What might seem like a 10% effort project will end up being much more time-consuming than you think.
How much is your team learning from incidents? To truly move forward in the nebulous goal of “being more reliable,” you need two things: a streamlined incident response process that helps you resolve incidents more quickly, and the commitment to learning from incidents and using those learnings to improve your systems.
How much incident expertise exists in your company? Just like with your observability and monitoring tools, some experts are devoting their entire careers to building all of this for you. Opting for the “buy” option gives you the added expertise of people who are thinking about incidents all day long.

If all you’re looking to do is automate a few things and you have the resources for building your own tool, it might be worth a try. There are open-source tools and IFTTT-Slack integrations that can make automating a few simple processes easy. But anything beyond that and you get into expensive territory that takes precious development hours away from your core product.

Proactive documentation and step-by-step guidance#proactive-documentation-and-step-by-step-guidance

Maintain up-to-date documentation of systems and services to save valuable time during incidents, providing key contacts and procedures for swift assembly. By removing the question “What do I do next,” you allow responders to move faster. This also makes for smoother handoffs between teams or shifts, ensuring the right information reaches the right people promptly. Consider building out response plans for each of your major functionalities or services to start.

If you’re building a service catalog from scratch, though, keep it simple. At its most basic, a service catalog is simply a list of internal and external technical services (enterprise applications, task-specific tools, microservices, APIs, and so on) used by your organization, and relevant details like owner, code location, and operational dashboards. By documenting this information, you help knock down knowledge silos and ensure everyone has the information they need to respond to incidents confidently — a big deal when you’ve just been paged at 1 a.m.

Start by listing all the services with their owning and responding teams, contact details, repositories, documentation, and monitoring dashboards. If you’re managing a monolith instead of microservices, you can still use a service catalog. Break down any monoliths by module, components, or product surface area. Each product area should have an engineering team associated with it, and those teams should be trained on your incident response process.

Post-incident analysis#post-incident-analysis

Your incident response process isn’t complete until a retrospective occurs. Conduct thorough post-incident retros to identify areas for improvement, learning from past incidents to enhance the overall incident management process.

Despite their usefulness, we’ve found that not all teams hold retros after every incident, perhaps thinking they’re not worth the time or effort, especially on lower-severity incidents. In fact, in our Incident Benchmark Report, an analysis of 50,000 incidents resolved on the FireHydrant platform, we found that on average, retros are performed after about 29% of low-severity incidents and 42% of high-severity incidents. The interest is growing though — we also saw a 236% increase in the monthly average number of retros per company, between 2021 and 2022.

At FireHydrant, we schedule retrospectives for incidents SEV2 and above within 48 hours of resolution. We schedule those meetings for 45 minutes — any longer, and we risk hitting meeting fatigue. For lower-severity incidents, we ask all participants to add their thoughts async to the retro doc in our FireHydrant account within the same time frame.

Our timeline of 48 hours is intentional: We’ve found that people feel rushed when presented with less time to plan a retro. But if more time is offered, like extending the deadline to 72 hours after an incident, the retro is more likely to fall by the wayside. Enough time goes by that the incident no longer feels like a priority. Plus, engineers are human — no matter how important an incident felt at the time, people’s memories fade the more time passes.

Regardless of how we run the retro, we always include these questions:

What was the full timeline of the incident?
What was the impact on customers?
What went well?
What did we learn?
What can we improve?
Embrace the benefits of better incident management

Efficient incident assembly and handling are critical for organizational resilience and productivity. By embracing modern incident management practices and following best practices, organizations can improve their incident response, minimize downtime costs, and foster a positive engineering culture.

The data-backed insights from FireHydrant and ESG's collaborative study highlight the tangible benefits of optimizing incident management practices. As businesses continue to navigate incidents, it is essential to prioritize efficient response processes, enabling teams to resolve incidents swiftly, reduce costs, and build more reliable and resilient systems.

Chapter 2: The opportunity costs of incidents#chapter-2-the-opportunity-costs-of-incidents

The cost of incidents goes beyond the time spent resolving them; it's also the lost opportunity to focus on developing the next big thing. This is where the concept of opportunity cost creeps in.

Incidents are an inherent part of our software development journey. But for engineers who are charged with maintaining and debugging their own software, incidents can mean getting thrust into a firefighting mode that diverts our attention from our regular projects and revenue-generating tasks.

Section 1: Examining the opportunity costs of incidents#section-1-examining-the-opportunity-costs-of-incidents

Opportunity costs mostly center around lost developer productivity. Instead of building their core revenue-producing products, they’re being sidetracked by having to respond to incidents. The easier it is to respond, the faster they can get back to their core job. The opportunity cost is the loss of what would have been built with that time.

That tax on employees’ time is where a lot of implicit costs come in. There are rote — often manual tasks — that have to be completed. For example, most incident response processes involve declaration and assembly tasks that have to be done before actual problem-solving can even start. Then, while problem-solving, you have additional periodic housekeeping-type tasks, like communication updates and timeline documentation, that interrupt response efforts with the potential to prolong the outage.

Think back to the last incident you had to handle. How much of your valuable time was consumed by managing the incident (sometimes more difficult than mitigating the incident!), leaving less time for what you truly love — building excellent features and pushing the boundaries of innovation? Poor incident response practices, such as involving too many people or holding endless incident meetings, only exacerbate this lost opportunity.

Lost developer time opportunity#lost-developer-time-opportunity

When engineers are interrupted to respond incidents, they are taken away from working on revenue-producing products and projects. Inefficiencies in incident response might include:

An all-hands-on-deck approach to incidents means too many people are joining each incident.
A lack of service ownership means it’s a scramble to find the point person and roll back plan for each service or functionality.
Rote activities, like communication updates, channel/bridge creation, retro assembly, etc. are all done by hand, wasting time that adds up.

All of these inefficiencies mean that engineers are diverted from their revenue-producing roles for longer than necessary on each incident. For our composite organization, the estimated opportunity cost per hour for engineers not working on projects, instead working on incidents is $304, a 5x uplift over the engineering salary, the margin contribution to the business.

For our composite organization of 1,500 engineers, spending an estimated 0.4% of their time on incidents, the annual cost of the lost developer time opportunity adds up. They spend 13,892 person-hours — or 7.4 full-time equivalents — on inefficient incident response per year instead of development.

That puts our total annual opportunity cost at $4.2M.

Engineering ramp-up#engineering-ramp-up

When there is a lack of procedure documentation, guides, and playbooks it is difficult for a new engineer to come up to speed in a reasonable time frame. And because DevOps turnover is relatively high for most organizations, getting engineers up to speed quickly is important but difficult in a reactive, ad-hoc, manual incident response environment. So this is another area of costs we must consider.

For our composite organization, the annual cost of ramp-up lost opportunity is:

20% new engineers as additions and replacements each year (which can be much higher in some organizations), 300 for our composite organization
48-week ramp-up duration on average for incident response, with 45% effectiveness during the ramp-up time
0.5% of time spent on incident handling
Equivalent to 1,277 wasted person hours and 0.7 full timers

That adds an additional $90k in lost opportunity cost from ramp-up inefficiencies each year.

The total opportunity cost of incidents#the-total-opportunity-cost-of-incidents

This puts our total estimation of the dollar amount attached to lost engineer team at $4.3M.

So, what can we do about it?

Section 2: Ways to lower the opportunity costs of incidents#section-2-ways-to-lower-the-opportunity-costs-of-incidents

As responders, we can proactively address incident management challenges and minimize the opportunity cost with a few practical steps. You can implement these best practices yourself, or consider purchasing a tool that uses these best practices to streamline incident response for you (like ours).

Effective communication and collaboration#effective-communication-and-collaboration

Open lines of communication are vital during incidents. Ensure everyone is on the same page and incidents are communicated promptly. Streamline communication channels to avoid confusion and delays.

Most organizations have a handful of people or teams who swoop in and save the day when a technical crisis arises. When it comes to resolving incidents swiftly, less is sometimes more. But not everyone can be or should be what we call a Shaq.

Shaq is one of the most celebrated NBA players of all time. He played for six teams over his 19-year career and won countless awards — for good reason. When the team needed two points, they knew they could throw the ball to Shaq, and let him go to work.

So what’s the problem here? Games are getting won, incidents are getting solved, it’s true. But are you truly setting your organization or team up for success by “doing it for them” every time? Relying on the most dominant player to save the day isn't a scalable solution for an organization’s continued success.

Robust incident response planning and role definition#robust-incident-response-planning-and-role-definition

Work with your team to create a clear, structured incident response plan. Define roles, responsibilities, and escalation procedures so everyone knows what to do when an incident occurs. A well-thought-out plan saves time and helps us return to our revenue-producing tasks faster.

Here are three approaches to defining incident response roles based on maturity level, if you’re just getting started on building a foundational team.

Incident manager: owns the entire process.Someone with a decent amount of institutional knowledge who can provide direction and communicate updates at large.
Incident responders: everyone who mitigates incidents.
Internal stakeholders: anyone invested in the outcome who must be kept informed.

In a smaller team, these roles are usually filled by the engineers or engineering managers on-call and may vary from incident to incident. Leverage automation Use incident management tools to automate repetitive tasks. Adopting an incident management automation tool can help businesses simplify and conquer incident response challenges to increase efficiency and customer satisfaction while decreasing burnout.

Automation removes the manual burden, reduces errors, and lets us focus on what matters most – writing code and building innovative products. ESG found that with optimized incident management, lost productivity time is reduced by 10%, providing a created work benefit of $422,000 per year. These strategies include:

Removing the “what do I do next” part of an incident by automating workflows from the start of an incident all the way through to the retro
Integrating the tools in your incident response tech stacks, like Slack, Zoom, Statuspage, GitHub, and Jira so they can speak to each other — and ideally so you can get a single dashboard look at your incidents
More easily identifying downstream impacts and automatically pulling in service owners and other stakeholders
Simplifying overall communication, both internally and externally with customers through status pages
Automatically importing data from the incident Slack channel into the retro doc to create a timeline that makes learning from the incident easier than ever
Providing a single source of truth to help unify engineers around incident management

Ultimately, these are the areas that are going to help you get off the incident treadmill. Incident management tools provide you with out-of-the-box automation around assembly and communication that help you get to actually fixing the problem faster.

Prioritize incidents wisely#prioritize-incidents-wisely

Not all incidents are equal, right? This means severity matters. Severity levels set expectations and help your response teams respond effectively. Otherwise, without a workable way to prioritize incidents, you’re planning for failure.

Establish a transparent prioritization system based on the potential impact of incidents. Allocate resources according to the urgency and importance of the issue. Higher-severity incidents usually demand a more significant allocation of resources, such as experienced engineers, technical experts, or even involving cross-functional teams. Lower-severity incidents may require fewer resources, allowing teams to focus on higher-priority issues.

Of course, the more incidents directly impact your customers, the more severe they are. Your incident prioritization should account for customer impact.

Learn from incidents#learn-from-incidents

Continuous learning is the key to optimizing incident response. Conduct a thorough post-incident analysis after resolving an incident to learn how well your people, processes, and products perform under pressure. Each incident provides a chance to implement improvements and prevent similar issues in the future.

If you’re not already documenting timelines, notes, and thought processes during your incidents, now is the time to start. Time gets fuzzy during an incident, and we can’t rely on memory alone. Documentation provides an excellent way for the team to review what happened after an incident.

As you gather data, don’t miss the opportunity to review and flesh out details as soon as the incident is under control and mitigated. Highlight important aspects that could benefit from further analysis–this documentation will be important to revisit later.

Invest in skill development#invest-in-skill-development

Enhance your team's skills in incident response. Proper training equips us to resolve incidents efficiently and effectively, minimizing unnecessary steps and delays.

A team benefits from players who have differing specific roles and specialties all working together under sound coaching and organizational direction. Give the other members of your team the opportunity to learn, expand their own skills, and bask in that hero’s glow. Provide guidance, give people the info they need (or better yet, help them figure out where to find it), and lend a hand when asked.

Furthermore, consider how you can help others on your team step up the next time an incident arises. There’s no better way to flag a weakness up the chain of command than by demonstrating what happens when you’re not there to fix things, and that’s a surefire way to direct some resources toward incident management.

Safeguard against opportunity costs#safeguard-against-opportunity-costs

Proactively address the opportunity cost of poor incident management by understanding the risks and mitigating their toll on your organization. Guard your productivity and revenue generation with effective communication, robust incident response, automation, prioritization, learning, and skill development.

Chapter 3: The cultural drain of incidents#chapter-3-the-cultural-drain-of-incidents

By now we know that when incidents strike, the consequences go beyond resolving the issue — they can be culturally draining as well. Engineers can quickly get burned out with a fire drill on each incident declaration and process, affecting their well-being and job satisfaction. The constant interruptions to their daily mandates and schedules lead to disengagement, making them feel detached from their core responsibilities and goals.

Section 1: Examining the cultural drain of incidents#section-1-examining-the-cultural-drain-of-incidents

The drain on our engineering culture caused by poor incident management practices can be overwhelming. Product and engineering teams are happiest when they’re consistently creating solutions that provide value to their customers. When they begin to constantly context switch between value-adding deployments and unexpected incidents, burnout can settle in fast.

It's not just engineers you risk burnout with though; poor incident response efforts affect other teams as well. For example, customer support and account management teams are on the front when your company has an outage, too. If communication efforts are stunted and the incident status is unclear, they may create their own ad-hoc process and accidentally distract incident commanders by pestering for status updates, leading to an unpleasant experience for everyone. This is not only an expensive distraction, it’s also a morale killer.

When every incident is a fire drill and they see no improvements being put in place, engineers can quickly get burned out, leading to disengagement and higher-than-anticipated turnover. Not all turnover can be mitigated, but incidents can certainly add up and tax engineering patience.

In our composite organization with 1,500 engineers, turnover is high and each replacement is expensive. We estimate that a 10% annual voluntary turnover (team members leaving on their own for other opportunities) rate, they replace 150 engineers at each year. Estimating half of a fully burdened salary, or $57k to replace each lost team member, that results in $8.6M in annual impacts from turnover.

However, it's not all doom and gloom. We can change this narrative and create a more positive and productive incident management culture.

Section 2: Ways to lower the cultural drain of incidents#section-2-ways-to-lower-the-cultural-drain-of-incidents

For starters, responding team members will feel more prepared if they know there is a concrete incident response plan. Once you have your response process defined, build your team’s confidence by boosting their operational knowledge and giving them space to make mistakes in a safe environment. During an incident, the most important goal is mitigation — the team should focus on decreasing customer impact, not on long-term resolution or root-cause analysis. Once systems are running, you can do a deeper investigation with less stress.

Proper preparation via practice, planning, and ensuring your team understands how to approach an incident in the heat of the moment will lead to a more efficient incident response and a much calmer, happier team. Here are some additional ways to improve.

Embrace a blameless post-incident review culture#embrace-a-blameless-post-incident-review-culture

Let's shift the focus from finger-pointing to learning and growth. By encouraging a blameless post-incident review culture, we create a safe space for engineers to share insights and lessons learned without fear of repercussions. This transparency fosters a culture of openness and continuous improvement.

To develop a blameless review culture, stick to the facts without accusations. Keep your review from becoming personal. Ultimately, a toxic culture does nothing to prevent future incidents. Retros should take a continuous improvement approach that looks squarely at uncovering prevention strategies. Essentially, make your retros about the future–this encourages an open, honest culture more apt to problem-solve.

Invest in incident management training and skill development#invest-in-incident-management-training-and-skill-development

Incident management is a specialized skill that requires training and ongoing development. Invest in incident management workshops, simulations, and resources to empower your teams with the knowledge and tools to handle incidents efficiently.

Incident training game days help build confidence among engineers by giving them an opportunity to familiarize themselves with the incident response process and tools before they’re in the throes of a high-pressure incident. It’s an opportunity to poke holes in your processes — What’s not accounted for? Where are there gaps? — which you can then account for outside of an incident.

Everyone, even seasoned engineers, can benefit from running through the processes live in a no risk, simulated environment.

Celebrate incident response successes and efforts#celebrate-incident-response-successes-and-efforts

Recognize and celebrate the hard work put into resolving incidents promptly and effectively. Celebrating incident response successes fosters a positive and resilient culture where engineers feel valued and motivated to tackle future challenges.

Focusing on incident response efforts that were full of mistakes is a natural tendency — and it’s important to know what failed and why — but your team also needs praise and positive reinforcement of successes. When responses go well, calling attention to what everyone did right helps your engineers and others know they’re appreciated and their work is recognized.

And rather than focusing on a single hero (even if one member played a significant role), looking for ways to empower the whole response team to step up helps ensure you’re prepared during future incidents. Give different responders chances to build their skillsets and actively contribute.

A brighter future with efficient incident management#a-brighter-future-with-efficient-incident-management

By prioritizing incident management and nurturing a supportive engineering culture, we can overcome the hidden costs of poor incident management. Adopting modern incident management platforms and best practices can streamline incident assembly, improve incident handling, and reduce downtime costs.

Great incident management is a differentiator#great-incident-management-is-a-differentiator

For the composite organization, these costs include almost 30,000 engineering person hours squandered across these categories each year, an impact of 15.7 FTEs. Not every one of these costs can be avoided, but there is a clear opportunity to replace manual, ad-hoc, reactive incident response efforts with a better approach, helping to substantially address this lost productivity and opportunity cost.

For the composite organization, these costs add up annually to:

Explicit costs of incidents: $5,407,303
Opportunity costs of incidents: $4,311,778
Cultural drain of incidents: $8,570,400

That’s a total of $18,289,480 in total cost of incidents, inclusive of both time and dollars.

So many DevOps functions have been automated and transformed to help overcome challenges and reduce costs, with one final frontier remaining that hinders an organization’s ability to meet business and customer expectations: Incident management. With such a high annual cost of incidents, opportunity cost, and cultural drain, it is high time to address the incident management challenge.

Putting some thought into your incident management strategy on the front end will help you maximize your engineers’ time and refocus that energy into your product. More than ever, we should all be focused on shipping great products, retaining high-demand engineers, and building trust with customers. And investing in a thoughtful incident management strategy is one way to get there.

Simply put, incident management done well can be a differentiator — in outpacing the competition, and recruiting and retaining talented engineers. It’s an area where investment can give you exponential value beyond the actual incident.

If you’re interested in hearing more on this topic, we'll be hosting The Better Incidents Summer Bonfire in September. It's a 40-minute roundtable discussion on the high costs of poorly managed incidents. Featuring a panel of incident responders from guest companies, the Bonfire will explore the explicit costs, implicit costs, and cultural drain of unoptimized incident management and include time for Q&A. Save your spot now.