More than downtime: the opportunity costs of poor incident management
The total cost of incidents goes beyond the time spent resolving them. It also includes the cost of time that otherwise would’ve been focused on developing the next big thing. That’s opportunity cost.
By Robert Ross on 8/24/2023
In my last blog post, I wrote about the explicit costs of incidents — the ones you can easily track based on dollars lost. But the cost of incidents goes beyond the time spent resolving them. While we’re spending our time managing incidents (that includes mitigating and retrospectives), we’re incurring a large opportunity cost in terms of releasing the next big thing. Every time there’s lackluster management of an incident, you’re swiping the good ol’ opportunity cost credit card, and the interest is astronomical.
Think back to the last incident you had to handle. How much of your valuable time was consumed by managing the incident (sometimes more difficult than mitigating the incident!), taking your time away from what you love – building excellent features?? The messier the incident management practices, the longer it takes to resolve an incident, and the longer teams aren’t focused on their core goals.
In our latest ebook, Beyond Downtime: A Practical Guide to Minimizing Incident Costs, we delve into the explicit costs, opportunity costs, and cultural drain caused by incidents. In this post, let’s explore how the right incident management practices can help save time and get you off the incident treadmill and back to building.
To do this, we’ll use data-backed research from Enterprise Strategy Group (ESG) that dives into the average cost of incidents modeled for a high-tech company with 5,000 employees and an average of 51 incidents per month. For a more complete view of the data, be sure to check out the book.
The cost of lost opportunities
Opportunity costs mostly center around lost developer productivity. Instead of building their core revenue-producing products, they’re being sidetracked by having to respond to incidents.
That tax on employees’ time is where a lot of implicit costs come in. For example, most incident response processes involve declaration and assembly tasks that have to be done before actual problem-solving can even start. Then, while problem-solving, you have additional periodic housekeeping-type tasks that interrupt response efforts with the potential to prolong the outage. Poor incident response practices like the ones below only delay mitigation and add to the already-rising opportunity cost.
An all-hands-on-deck approach to incidents means too many people are joining each incident.
A lack of service ownership means a scramble to find the point person and roll back plan for each service or functionality.
Rote activities, like communication updates, channel/bridge creation, retro assembly, etc., are all done by hand, wasting time that adds up.
Ultimately, these inefficiencies mean that engineers are diverted from the main reason we hire engineers in the first place: Building things that you can sell.
For their composite organization, ESG estimated opportunity cost per hour for engineers working on incidents instead of projects at $304, a 5x uplift over the average engineering salary, the margin contribution to the business. At our 1,500-engineer composite company, that adds up to 13,892 person-hours — or 7.4 full-time equivalents — on inefficient incident response per year instead of development. That’s the equivalent of $4.2 million in total annual opportunity cost.
And if this is the cost for existing employees, what’s the impact on onboarding new employees? DevOps turnover is relatively high, so getting engineers up to speed quickly is important but difficult in a reactive, ad-hoc, manual incident response environment. They don’t know who to turn to, how to escalate, where to look.
ESG estimates that the annual lost opportunity cost of onboarding and ramp-up for our composite organization can run an additional $90,000 a year, which puts the total opportunity cost at $4.3 million annually.
“Robert, I get it, what can I do about it?” I’m glad you asked!
Lowering the opportunity cost of incidents
As software engineers, we can proactively address incident management challenges and minimize the opportunity cost with a few practical steps. You can implement these best practices yourself or consider purchasing a tool that uses these best practices to streamline incident response for you (like ours).
Robust incident response planning: Work with your team to create a clear, structured incident response plan — preferably one that centers around services and their owners. Define roles, responsibilities, and escalation procedures so everyone knows what to do when an incident occurs. A well-thought-out plan saves time and helps us return to our revenue-producing tasks faster.
Effective communication and collaboration: Open lines of communication are vital during incidents. Ensure everyone is on the same page and incidents are communicated promptly. Streamline communication channels to avoid confusion and delays. One of the best ways to accomplish this is by having an internal status page that everyone in your organization knows about, and the default behavior is to reference it during incidents.
Leverage automation: Use incident management platforms like ours to automate repetitive tasks. Automation removes manual burden, reduces errors, and lets us focus on what matters most – mitigating an incident as fast as we can. ESG found that with FireHydrant, lost productivity time is reduced by 10%, providing a created work benefit of $422,000 per year.
Prioritize incidents wisely: Not all incidents are equal, right? Establish a transparent prioritization system based on the impact of incidents, and make sure everyone understands it. We recommend focusing on a severity system and then allocating resources accordingly.
Learn from incidents: Conduct a right-sized retrospective after resolving an incident. Learn from each incident and implement improvements to prevent similar issues in the future. Continuous learning is the key to optimizing incident response.
Invest in skill development: Enhance your team's skills in incident response. Think about running planned and unplanned drills that help build responder confidence. Proper training equips us to resolve incidents efficiently and effectively, minimizing unnecessary steps and delays.
The opportunity cost of poor incident management is a real challenge and shows that the total cost of incidents goes far beyond downtime. We recommend being proactive and implementing these practical tips so you can minimize the impact of incidents on your productivity and, ultimately, your company’s bottom line.
Read more about the total economic impact of incidents and how FireHydrant lowers them in the report, Analyzing the Economic Benefits of FireHydrant Full-cycle Incident Management.
See FireHydrant in action
See how service catalog, incident management, and incident communications come together in a live demo.Get a demo