The hidden costs of poor incident management

With the market proving its unpredictability and budget holders tightening purse strings, we’re all looking to maximize every dollar spent, every hire made, every hour logged. But there’s one cost center you might not be fully thinking about — incident management.

Let’s face it: a goal of zero incidents is unrealistic and will only set your engineering teams up to fail. But a goal of reducing both the explicit and implicit costs associated with how you manage incidents? Now we’re talking. (Turning incidents into an actual value-driver for your company? Let's not get ahead of ourselves, but we've got ideas on that, too.)

But what are those costs? If you’re a company like Amazon, every minute you’re down can be converted into a certain dollar amount in lost sales, so a direct correlation is easy enough to figure out. But that’s not the financial model for a lot of companies, including FireHydrant — and in those cases, it can be harder to determine the impact.

In those cases, I think about the cost of things like staffing, overhead, workflow, throughput: In short, how much time is your team spending on incidents? If they’re in an incident, they can’t spend their time doing something else … like building your revenue-generating services or products.

In this blog post, I’ll discuss a few of those explicit and implicit costs associated with incidents, as well as some ideas for how you can reduce those costs.

The hard costs of incidents#the-hard-costs-of-incidents

These are the costs that are easiest to quantify and that most people think of when they think about incidents. We’re talking downtime, SLA breach paybacks, compliance fines, and the list goes on.

A cold fact of SaaS life is that you can’t make money when your product or website doesn’t work — and those lost dollars add up fast. According to the Uptime Institute’s 2022 Outage Analysis, the number of incidents resulting in at least $100,000 in total losses jumped from 39% in 2019 to 60% in 2022. And the share of outages that cost upward of $1 million increased from 11% to 15% over that same period.

It’s not just about SLA breaches and customers not being able to access your product though. For example, marketing campaigns and ad dollars that direct potential users to your site or product are wasted during an outage. And we also have the headcount expense of actually managing and mitigating the incident itself. When we’re talking about engineers, customer support, and/or account management (and in severe cases, executives, marketing, legal, etc.), the direct costs of an incident inflate quickly.

The opportunity costs of incidents#the-opportunity-costs-of-incidents

That tax on employees’ time is where a lot of implicit costs come in as well. There are a number of rote — often manual tasks — that have to be completed. For example, most incident response processes involve declaration and assembly tasks that have to be done before actual problem solving can even start. Then, while problem solving, you have additional periodic housekeeping-type tasks, like communication updates and timeline documentation, that interrupt response efforts with the potential to prolong the outage.

And that’s when you have instructions for those tasks outlined in your response process. Companies that don’t have a documented incident response process often turn to an “all hands on deck” mentality that actually slows down the response effort. We found in our Incident Benchmark Report, that teams that counter this mentality by including role assignment on incidents saw a 42% decrease in mean time to resolve.

This act of pointing engineers at tasks that could be automated, or, worse, throwing them into a chaotic response effort slows down the mitigation process, but it also creates a huge opportunity cost elsewhere. When your engineers are busy fighting fires, they’re not working on your revenue-supporting products. In fact, according to Forrester, 47% of companies that experience downtime say they contribute to a loss in productivity.

In effect, you’re slowing everything down with poor (or no) incident response tactics. Suddenly features are shipping slower and technical debt is adding up because engineers are spending time and mental energy elsewhere. This leads to a lack of improvements, makes scaling more difficult, and ultimately impacts morale, our next hidden cost.

The cultural drain created by incidents#the-cultural-drain-created-by-incidents

Product and engineering teams are at their happiest when they’re consistently creating solutions that provide value to their customers. When they begin to constantly context switch between value-adding deployments and unexpected incidents, burnout can settle in fast.

And when you mismanage your incidents, you don’t learn from them, which means the same incidents keep popping up and you keep using the same crappy process to manage them. This can lead to attrition and can hurt your reputation among potential employees as well. To build and maintain a healthy culture that attracts top talent, you need to have programs and processes worth top-talent brainpower.

It's not just engineers you risk burnout with though; poor incident response efforts affect other teams as well. For example, customer support and account management teams are on the frontlines when your company has an outage, too. If communication efforts are stunted and the incident status is unclear, they may create their own ad-hoc process and accidentally distract incident commanders by pestering for status updates, leading to an unpleasant experience for everyone. This is not only an expensive distraction, it’s also a morale killer.

Ultimately, this can all lead to a damaged reputation among customers, who might opt to go with competitors who have a reputation of being more reliable, even if they’re not truly. The simple act of how you communicate and show ownership over your incidents has the potential to buy you grace with customers. If your competitors have a more even-keeled culture when it comes to incidents, prepare to lose business to them.

So now what?#so-now-what

You can’t control the fact that you have incidents. There have been incidents since the dawn of software development, and we’ll never eradicate them totally. And that’s good! It means we’re innovating, trying new technology, solving new problems. The key is to control what you can — and that’s how you manage the incidents you do have.

If you’re looking to lower the costs associated with incidents in your organization, there are two places I recommend starting:

First, determine how much time your team is spending on incidents. One way to do this is by calculating the mean time to resolution for your average incidents (right around 24 hours on average, according to the Incident Benchmark Report). Remember, when your team is working on incidents, they’re not working on something else, so this can be a good way to measure the opportunity cost of incidents.

One area to zero in on where you can have a fast impact is mean time to respond. This is the length of time it takes from when the problem is identified to when the mitigation effort begins. This number can tell you a lot about whether you're burning time and money on rote tasks and toil. And, since it's an area that's ripe for automation, it's a great place to start eliminating costs.

And second, get an understanding of where your dysfunction lives, literally. What’s the functionality area or service that’s causing the most problems? By tracking where your incidents are coming from, you can proactively address the issues that impact morale, burnout level, and ultimately, attrition rates.

Streamlining how you manage incidents and automating tasks, like assembly and communication, helps you respond faster to incidents and decrease downtime. By creating a structured process for how you respond to incidents, you can ensure the right people are brought in at the right time to minimize chaos. And by investing in learning from these incidents by tracking analytics and holding retros, you ensure improvement in not just your product but also your people and processes.

Want to dig deeper into this topic? Check out our on-demand webinar, Proving ROI: How to evaluate and improve how you manage incidents.