There's a better way: how an incident management tool helps you conquer response challenges
Struggling with incident management? You're not alone. Adopting an incident management tool can help businesses simplify and conquer incident response challenges to increase efficiency and customer satisfaction while decreasing burnout.
By Mike Lacsamana on 10/3/2022
As a solutions engineer for FireHydrant, I speak with a wide variety of companies about their incident management programs — from start-ups with a handful of employees to large enterprise companies with thousands of engineers.
Whether they’re looking to establish their incident management program or mature it, the same questions remain:
How can we make it easy for any engineer to effectively respond to an incident?
How can we build confidence in our teams so they feel prepared during an incident?
And, of course, the holy grail of questions: How can we reduce toil?
These common questions reveal a truth about incident management — companies of all sizes are still figuring out how to do it “right.” So if that’s you, don’t worry, you’re not alone. After talking to hundreds of engineers about their processes, we’ve identified five of the most common challenges we see across companies looking to put more structure behind how they manage their incidents. Maybe you even see yourself in one of these scenarios.
Your response efforts feel ad-hoc...every time
You have a process, but no one's following it
You’re suffering from the Shaq effect
You haven't set expectations with your stakeholders around incident response
Engineers are burnt out from being on call
1. Your response efforts feel ad-hoc…every time
Does this scenario sound familiar? A customer contacts support because your website is down. Support doesn’t know the right person to alert, so they send an “@here” message in an engineering Slack channel (I’m guilty of this). While that conversation is happening, another well-intentioned team member catches the outage on their own and cues up a dedicated Slack channel to discuss the issue.
Now you have multiple conversations about the same issue — more folks are jumping in, communication is getting messier, nobody knows who is doing what, efforts are getting duplicated, nothing is getting documented, and your problem-solvers are so distracted by pings that they can’t get to work on the incident at hand.
Without documented processes and a single source of truth for how you declare, alert, respond, communicate, and learn when an incident occurs, you run the risk of a messy response effort that can lead to more confusion than resolution.
2. You have a process, but no one’s following it
Recently, I spoke to a large, multinational company dealing with a major problem in their fully built-out incident management practice: their engineers weren’t following it.
It turned out that their engineers weren’t reporting incidents because the process was time-consuming. They were anxious to mitigate the issue as fast as possible and didn’t want to waste time doing administrative work. By the time they got around to reporting incidents, they were already remediated — leaving customer success and leadership far out of the loop, and customers in the dark.
Of course, this has downstream effects too. Without documenting the incident timeline in real-time, responders found themselves scrolling through hundreds of messages just to pull together a timeline of key events that occurred during the incident. This makes for an ineffective retrospective that leads to a missed opportunity to put long-term resolutions into place to improve overall reliability.
3. You’re suffering from the Shaq effect
We’re all familiar with this story: An engineer who’s been with a company for several years is the hero when it comes to responding to an incident. No matter when the page comes in, they’re the ones who answer it and know who else to call, how to roll back service deploys, and who to communicate with. They have the service catalog (or monolith) in their head and are a one-person incident response team. Around here, we call that the Shaq effect.
The problem with using a Shaq as your incident management strategy is that when Shaq’s not around, the team has no idea what to do — and the rest of the teammates aren’t learning because they know that Shaq will just come in with a slam dunk. In reality, nobody wins here.
If a superhero solo engineer is the only person who can handle an incident, not only are you risking burnout from one of your clearly valuable (and hard to replace) engineers, but you’re also stunting the growth of other folks on the team. Documented, organized processes will help more fairly balance the load — both the challenges and the wins.
4. You haven't set expectations with your stakeholders around incident response
Incident management isn’t limited to engineers. Depending on the incident, they might also involve members of customer success, support, and sales. And for severe incidents, you might be looking at executive leadership, legal, and even marketing thrown into the mix.
It’s certainly not uncommon for these stakeholders to express frustration or a sense of urgency in the heat of an incident. Most of us would expect that. But if their involvement is becoming more frequent and distracting, it might be a sign that they aren’t getting enough visibility and communication from the incident management team.
For organizations that mandate customer communication in service-level agreements (SLAs), communication breakdowns can be a big (unintentional) violation of the contract. Having an automated solution for this can help reduce the stress of your on-call engineers and ensure that customers stay in the loop.
Between logging incident details and working with your engineering team to resolve the incident, incident management teams have a lot to work on. Trying to manage frequent communications at the same time is a big ask, but without taking the time, the rest of the organization will be left helpless or asking for more.
5. Engineers are burnt out from being on-call
Burnout is a common problem in incident management. In fact, burnout due to on-call responsibilities is one of the most common complaints we hear from engineers looking to add structure to their incident programs.
One thing that we can all do to help prevent burnout is to automate the rote tasks associated with coordinating and communicating an incident. When a page goes out at 2 a.m., there are a host of questions the responding engineer has to answer:
Who needs to be involved?
What departments need to be alerted?
How and when is it appropriate to communicate with the customer base?
Most importantly: How can I fix this quickly?
This kind of pressure is an invitation for mistakes and stress. By reducing the cognitive load of coordination, responders save minutes — and when we’re talking about an outage that rips you out of a deep sleep, that can feel like hours.
There’s a better way
If you identify with these scenarios, you’re not alone. Companies of all sizes struggle with these challenges: it was our founders’ experience managing incidents themselves that led to the creation of FireHydrant, the tool they wished they’d had during their on-call response days. Now, we help companies like CircleCI, Snyk, and Spotify tackle these issues by:
Removing the “what do I do next” part of an incident by automating workflows from the start of an incident all the way through to the retro
Integrating with the tools used in most tech stacks, like Slack, Zoom, Statuspage, GitHub, and Jira, so you can manage an incident completely from within Slack with FireHydrant
More easily identifying downstream impacts and automatically pulling in service owners and other stakeholders
Simplifying overall communication, both internally and externally with customers through status pages
Automatically importing data from the incident Slack channel into the retro doc to create a timeline that makes learning from the incident easier than ever
Providing a single source of truth to help unify engineers around incident management
See FireHydrant in action
See how service catalog, incident management, and incident communications come together in a live demo.Get a demo