Inflation is running rampant, the world stage is unpredictable, and what’s happening in the U.S. markets has been dubbed the “tech wreck.” A common theme I’m hearing come up in conversations across industries right now is value — we’re all looking to maximize every dollar spent, every hire made, every hour logged.
For a lot of companies, this means looking at processes and tools with a critical eye for not only cost savings but also cost avoidance. This includes the opportunity cost of building something in-house versus buying a tool to do it for you.
More than ever, we should all be focused on shipping great products, retaining high-demand engineers, and building trust with customers. And investing in a thoughtful incident management strategy is one way to get there. Let’s explore how.
Focus on your product
Engineers come at a premium these days; they’re commanding higher-than-ever salaries, and the job market is competitive. So why waste that valuable headcount on something that isn’t directly related to your revenue-contributing software?
I’ve seen too many companies try to roll their own incident management products, spending months — or even years in some cases — working on a tool that is secondary to their core product. Think about not only the cost of full-time equivalent hours but also the loss of productivity.
Even to build a basic bot that helps you declare an incident in Slack might take two months of work for a high-salaried engineer. Then you need a product person to manage it, an infrastructure engineer to deploy and monitor it, and it’s getting pretty expensive. Then you actually have an incident where your software is down, and guess what — that shiny new tool might be down too.
Putting some thought into your incident management strategy on the front end will help you maximize your engineers’ time and refocus that energy into your product. For example,
Automated communication tools get everyone in the same room in an instant.
A service catalog helps you map your dependencies and see at a glance what’s down and who owns it.
Runbooks take the questions out of your incident response process. Instead of wondering what to check and in what order, you can simply concentrate on fixing the problem.
When an incident occurs, your engineering team is working to mitigate it, the marketing team is managing social media, your legal team is bracing for SLA obligations, your customer support team is hit with an influx of tickets, and the list goes on. A thorough and automated approach to incident management helps you get to fixing (and learning) faster, and gets everyone back to their core business goals.
Keep your engineers
One of the reasons I started FireHydrant was because I was that burnt-out, on-call engineer trying to fight fires without the right tools or information at 2 a.m.
An incident is declared — maybe through a page, maybe by a text from your CEO. Something is broken, but you don’t know what.
So first you need to figure out what’s broken and why. You might be searching for service dashboards or querying DataDog.
And you’re trying to get the right people in the room, but since you don’t know what’s broken, that’s hard to do.
You figure out what’s broken. After looking in three places where your company keeps service listings, you figure out who to involve.
You figure out it was a recent deploy … now to figure out how to roll back that deploy in this particular service.
You get where I’m going. That’s a ton of cost even to just get to the part where you say, “I know what needs to be done.” The messier the remediation process, the broader the alerts, the less specific the SLOs — all of this means more engineer toil, more engineer burnout, and more engineer attrition.
It is exceedingly difficult to find and hire engineers right now, so losing anyone — especially to something as preventable as burn out — is more expensive than it’s ever been, both from an opportunity cost and a monetary standpoint.
Build trust with your customers
It’s such a trope at this point that I don’t even need to get into it — latency and downtime will cause customers to abandon shopping carts, give up on streaming content, and move somewhere else. Like I wrote about in a previous blog post, reliability is an everyone problem, not just an engineering problem.
The flip side though is that you can actually use good incident management to your benefit. When you’re known for handling your worst moments well, people pay attention.
I remember last year when Fastly went down in a big way. Their stock actually went up the next day because people were impressed by not only how quickly they remediated the problem but also how quickly — and clearly — they communicated what was going on.
Slack’s another one who does a great job of this. They treat minor bugs like incidents, publicly declaring them even when they might affect only a small percentage of people. When customers are visiting your status page instead of flooding your customer support queue, that’s cost avoidance. And when they know what to expect from you, know they can depend on you, they’re more likely to stick with you.
The bottom line
Simply put, incident management done well can be a differentiator — both when it comes to outpacing the competition and recruiting and retaining talented engineers. It’s an area where investment can give you exponential value that can be seen during an actual incident but can be felt reverberate in much further reaching ways.