New reports stress the importance of strategic incident management practice

Engineers have been managing incidents for as long as they’ve been building software, but the idea of incident management as a strategic practice in its own right is still finding its place. We’re starting to see big shifts in that area, though — more companies are dedicating headcount, resources, and tools to help them better prepare for, respond to, and learn from their incidents.

Two pieces of research published recently show that we’re not the only ones taking notice. Constellation Research’s Shortlist for Incident Management, which came out last month, is one of the first analyst reports dedicated to incident management. And just last week, Digital Enterprise Journal (DEJ) published a new market report, Top 20 Emerging Vendors for Managing IT Performance in 2022, that listed “continuous learning to improve incident resolution” as a key area that organizations should focus on to drive business benefits from technology deployments.

We are honored to be included in both of these publications, but what we’re truly excited about is the fact that industry researchers and analysts are starting to take note of the benefits that can come from implementing a proactive, learning-focused incident management strategy. Let’s talk about three of those benefits here.

Your reactive scramble becomes a proactive strategy#your-reactive-scramble-becomes-a-proactive-strategy

As I wrote in a previous post, incident response is not the same as incident management. Response is a reactive series of actions taken with the intent to remediate an incident. It’s a part of an overall proactive incident management strategy.

Let’s be real though: Reactive responses need to happen. Something goes down, and you need to address it. But good incident management adds a layer of predictability to what can feel like a chaotic and often ad hoc effort. By operationalizing those response efforts to be more consistent and pairing them with a commitment to learning, organizations become more strategic in their approach to an incident, as well as to the health of their systems overall.

This is what incident management is all about: being proactive about how you respond to incidents and turning the lessons you learn from them — about your people, processes, and technology — into actions that ultimately bolster your organization's reliability efforts.

Your engineers become less burnt out#your-engineers-become-less-burnt-out

Being on-call and trying to navigate incidents with a poorly defined, completely manual, or ad hoc response process is frustrating and likely not the best way to encourage and engage your engineers. Nothing will contribute more to burnout than having a 2 a.m. page that leaves an engineer scrambling to find the right people and information, just to turn around and start their job actually building the product at 9 a.m.

By standardizing and documenting an incident management strategy, complete with service dependencies and a severity matrix, companies are able to decrease the cognitive load on the on-call engineer while also ensuring incidents are addressed and managed in a timely fashion with the right level of engineer involvement.

For example, once you put a service dependencies map in place, your teams can more efficiently disseminate critical information to the right parties who need to be notified and directly involve the subject matter experts of the impacted services.

Worth noting is that according to the DEJ Report, which surveyed more than 3,300 organizations on a variety of topics about managing IT performance, the complexity of IT systems has increased at an average rate of 3.2X over the past two years.

Increased complexity can cine with a variety of consequences, often taking the form of lost revenue. DEJ survey respondents reported that revenue loss averages about $634,000 per month due to application slowdowns. This further illustrates why implementing incident management practices, like creating service dependencies and documenting runbooks that tell your engineers what to do next during an incident, make sense for minimizing losses when incidents inevitably happen.

You deliver a top-notch customer experience#you-deliver-a-top-notch-customer-experience

More and more, the internet — and the software that runs on it — is expected to be a utility that’s always on and always working. Regardless of whether you’re a customer or an engineer (or in DevOps tooling cases like ours, both!), you want a seamless experience all around. For this reason, reliability is increasingly becoming a business metric. Last year, it was added as a fifth metric to the DORA report, and this year’s report doubles down on its impact on businesses.

Organizations leveraging incident management strategies with customer pain and communication at their hearts are the ones that are most likely to succeed. After all, it’s really your customers who get to decide if you — and your products — are reliable. Organizations need to ensure all relevant teams are involved to navigate an issue that can impact the customer experience.

Here’s a quick example: Slack does a great job of managing customer expectations. Their incident management plan calls for the company to treat minor bugs like incidents, which means Slack announces every disruption, even the ones that impact only a small percentage of people. Because Slack has established a method of clear communication, users know what to expect and automatically look to the company’s service page instead of flooding the customer support queue. This is no accident, it’s part of Slack’s incident management strategy — and it works.

The bottom line#the-bottom-line

When done well, incident management has a positive effect on many aspects of your business. Plenty of companies have already realized this and started to dedicate more budget and resources toward not only efficiently responding to them but also effectively learning from them. We’re excited to see that the topic is one more people are starting to take notice of, cause it’s one we certainly feel passionately about.

Learn more about why continuous learning from incidents is one of the keys to better systems performance in the full DEJ report.