An incident can take many forms. It can look like a small issue that locks a few customers out of their accounts or a huge catastrophe that brings down your entire product for a full day. How you respond to the incident should vary based on the impact of the incident. And that’s where severity comes into play.
Defined severity levels are crucial to any good incident management program. They quickly get responders and stakeholders on the same page on the impact of the incident, and they set expectations for the level of response effort — both of which help you get to fixing the problem faster.
Because incidents vary so widely by organization, implementing severity levels for a new incident management program can be difficult. In this post, we’ll give you a starting point for defining your severity levels and some tips on how to make them your own.
Establish a starting point
Severity definitions should be in plain language. You want them understood and used by every member of an organization, not only engineering. Without easily understood definitions, you’ll see severity applied inconsistently across incidents, which can potentially confuse the response.
There are varying ways to define severities — for example, some teams include a SEV0 to indicate an absolute catastrophe — but I say take the KISS approach (keep it simple to start) and use a SEV1 to SEV3 range. The definitions below can serve as a base for you to adapt to your organization.
SEV1: No customer can use most or all of a product or service. Most pages in our product are not loading or displaying an error message. Data corruption or loss has occurred or will occur. Loss of revenue is happening or imminent.
SEV2: Primary product functionality is severely impacted and unusable. Customers are unable to utilize a common feature to its fullest ability. Data may not be displayed as expected but not lost. There is no workaround for customers.
SEV3: Some customers (not all) are receiving intermittent errors on product pages or cannot use the product in possibly obscure ways. The product may be loading slowly or partially (missing images), and there is a workaround that customers can use.
As you learn more about your systems and see your response process in action, you’ll adapt these definitions and make them your own. Let’s walk through a few things to consider during this iterative journey.
The impact on customers
The ultimate measure of an incident is its impact on customers. Generally speaking, the higher the impact on your customer, the higher the severity of the incident.
When thinking about defining severity levels, it’s paramount to understand how you define uptime for your customers and take any contractual obligations, like customer-facing SLAs into consideration. Breaching an SLA often has punitive impacts on your organization and, more importantly, will surely lead to a poor customer experience. Your SLAs help you set a danger zone: the more your product gets to a state approaching SLA breach, the higher the severity of an incident.
If you don't yet have SLAs in place, think about worst-case scenarios for your customers. For example, your scale of impact might go from not being able to use your product at all to missing website elements that don't impact experience.
The impact on your team
Are you over-rotating on every incident being high severity?If everything is an emergency, you run the risk that your responders will suffer burnout and that people might wind up ignoring critical issues due to alert fatigue.
Think about how often your team has incidents and when they occur when adapting your severity definitions. For example, you might classify the same incident differently if it happens at 2 a.m. when your customers aren’t active, as opposed to at 2 p.m. during peak traffic. The all-hands-on-deck response effort you might employ then would be overkill at 2 a.m. Clear definitions can help both your engineers and stakeholders know what is expected of all parties and provide peace of mind with a right-sized response.
Adapting severity levels
If you get too specific about how you define your severity levels in the beginning, you might be boxing yourself into something that doesn’t work down the road. Experience is the best test.
Once you’ve documented your initial severity definitions, take note of how they work for you in action. These definitions shouldn’t be static; a good response process makes space for introspection and revision of your severity definitions (more on that in a minute). We’re living this approach at FireHydrant, too. In an effort to lower the anxiety around incidents, we recently introduced a triage severity and found it to be a huge boon for our engineering culture.
Put them to use
Once you’ve established your base definitions, it’s time to get your team using them. The first step is to document them. If you already have a documented incident response process in place, your severity definitions will be a great addition to them. If you don’t, this is a perfect place to start.
From the definitions to the response process each severity level requires, everyone should know what is expected of them when an incident is declared a SEV2, for example. To that point, documentation also needs to be easily accessible, whether it’s in a spreadsheet, an internal wiki, or part of a tool that guides you through the steps to take for each severity level.
Once your documentation is in place, socialize it. At their core, severity definitions are an agreement among responders and stakeholders on the impact of an incident and the level of response needed. If one party isn’t bought in on that agreement, it’s not very useful.
And then learn and adapt. Your incident retros are a great place to evaluate the size of a response effort relative to its impact (a great reason to start having them if you’re not already). If you find your definitions are missing the mark, simply adjust them.
Feel the impact
By using severity definitions to classify your incidents, you can right-size the approach and get everyone on the same page. These definitions provide the structure for your team to follow a repeatable process that removes the cognitive overhead for responders and stakeholders. This, in turn, makes it easier for them to decide what to do next and land expectations so they can get to remediating faster.
When everyone is aligned with a common language, they know how to categorize an incident, how to respond, and who is involved. And this means you can more efficiently manage the impact of the incident overall.