More than downtime: the explicit costs of poor incident management

A cold fact of SaaS Life™ is that you can’t make money when your product or website doesn’t work — and those lost dollars add up fast. Downtime, SLA breach paybacks, compliance fines, and other explicit costs are the easiest to quantify and they’re what most people think of when they think about incidents.

But even when we’re talking easily quantifiable costs, it’s not as simple as incidentduration*averagerevenue = moneylost. For example, marketing campaigns and ad dollars that direct potential users to your site or product are wasted during an outage. And, of course, there’s the labor expense associated with the incident itself — engineers, customer support, and/or account management (and in severe cases, executives, marketing, legal, etc.). The direct costs of an incident inflate quickly, and the longer it takes to mitigate the incident, the more those costs add up.

In our latest ebook, Beyond Downtime: A Practical Guide to Minimizing Incident Costs, we delve into the explicit costs, opportunity costs, and cultural drain caused by incidents. In this post, let’s explore how the way an incident is declared, assembled, and managed can either pave the way for a well-orchestrated, tight response effort, or a chaotic drain on our time and energy.

To do this, we’ll use data-backed research from Enterprise Strategy Group (ESG) that dives into the average cost of incidents modeled for a high-tech company with 5,000 employees and an average of 51 incidents per month. For a more complete of the data, be sure to check out the book.

Costs associated with poorly handled incidents#costs-associated-with-poorly-handled-incidents

As we’ll see, organizations with streamlined incident management programs are more equipped to jump into action quickly, completing their entire kickoff process with the click of a button in some cases. Suboptimal incident management practices, on the other hand, can complicate what’s already a sensitive moment.

For example, outdated documentation leaves on-call responders hunting for rollback plans and service ownership documents, and responders unknowingly duplicate triage efforts — all resulting in a frantic, all-hands-on-deck approach to incidents that can actually delay mitigation efforts (and frustrate responders). Let’s look at assembly time as an example.

Assembly time is the amount of time that occurs between an incident declaration and when mitigation work actually begins. It includes all of the necessary things that need to be done — or at least should be done — in order for an incident response effort to begin. This can include tasks like declaring an incident, creating meeting space (like a Zoom bridge and Slack channel), pulling in the owners of the affected services, and notifying other internal stakeholders like customer support. As you can imagine, incomplete or outdated processes and documentation consume precious resources at the beginning of the incident.

Based on research and interviews, ESG estimated that, depending on the severity of the incident, assembly of the right people, information, and processes using manual, ad-hoc processes, and poor documentation, can take anywhere from 12 to 90 minutes and involve 5 to 8 responders. That means that for our model organization, slow assembly time annually consumes 3,362 person-hours per year or the equivalent of 1.8 full-time engineers.

Of course, this has downstream effects. Bungling the assembly phase leads to inefficiencies throughout the response effort. Some of the common ones reported by both ESG and our own customers are below.

ESG estimates that poor incident response practices — incomplete handoffs, inefficient communication, etc. — add 2 to 8 person-hours per responder to those assembly costs. That’s a total of 10,530 person-hours per year, or the equivalent of 5.6 full-time engineers, and puts our total cost of labor around incidents in the ballpark of $740,000 for our model company.

Improving incident handling for smoother resolutions#improving-incident-handling-for-smoother-resolutions

However, with modern incident management practices and tools, organizations can see a significant reduction in assembly time. For example, FireHydrant customers told ESG that they were able to reduce assembly time to mere seconds thanks to streamlined assembly processes and automated toil. By unlocking the door to quicker response times and eliminating unnecessary involvement, teams see substantial time and cost savings for organizations.

ESG's findings revealed that FireHydrant reduced low-priority incidents from 2 person-hours to 1.8 hours and high-priority issues from 8 to 7.2 hours. This reduction in handling time translates to substantial annual savings of 2,948 person-hours, equivalent to 1.6 full-time engineers, and a financial benefit of $207,000 per year.

Best practices for optimizing incident assembly and handling#best-practices-for-optimizing-incident-assembly-and-handling

So how do you achieve these gains? To mitigate the costs and challenges associated with incidents, organizations can adopt a set of best practices to optimize incident assembly and handling like the ones below. These are also incorporated into the FireHydrant incident management platform, of course.

Well-defined incident response plan: Establishing a clear and structured incident response plan is essential. It should outline roles, responsibilities, and escalation procedures, enabling quick identification of the right personnel during an incident. Implement a clear prioritization system for incidents based on their potential impact, enabling effective resource allocation and prompt resolution of high-impact incidents. And you should practice it during peacetime.

Effective communication channels: Ensure open and efficient communication channels to promptly inform all stakeholders during incidents, reducing confusion, interference, and delays. This can be extended externally to customer status pages as well.

Automation and tooling: Leverage incident management platforms with automation and streamlined workflows to expedite incident assembly and handling, automate rote incident tasks, and allow teams to focus on efficient resolution.

Proactive documentation and step-by-step guidance: Maintain up-to-date documentation of systems and services to save valuable time during incidents, providing key contacts and procedures for swift assembly. By removing the question “What do I do next,” you allow responders to move faster. This also makes for smoother handoffs between teams or shifts, ensuring the right information reaches the right people promptly.

Post-incident retrospectives: Conduct thorough retros to identify areas for improvement. When you’re committed to learning from past incidents, you enhance the overall incident management process and invest in your systems’ reliability. Blameless retrospectives foster a positive engineering culture, too, which can help you retain hard-to-replace engineers.

Embrace the benefits of better incident management#embrace-the-benefits-of-better-incident-management

If you build and release software, you will have incidents. So it is essential to prioritize efficient response processes in order to enable teams to resolve incidents swiftly, reduce costs, and build more reliable and resilient systems.

To get a complete picture of the full explicit costs, opportunity costs, and cultural drain of incidents, be sure to read, Beyond Downtime: A Practical Guide to Minimizing Incident Costs.