Severity and priority can be challenging for a company to nail. When an incident is declared, it's essential to have a system to define the impact and how urgently it should be handled. Incident severity and priority are the two knobs teams can leverage to define scope and urgency, and eventually, the appropriate process to take action. But how should we define them, and what are the differences?
What is Severity?
Incident severity quickly explains the ballpark impact of an incident. Having your severity definition nailed down is vital before best managing incidents. The severity of an incident should be known company-wide, not just within engineering, as it helps everyone understand the impact. After all, reliability is a business metric, not an engineering metric.
There are many ways to define severities, but we recommend using the SEV1-5 system. In some processes, teams will include a SEV0 to indicate an absolute catastrophe. A sound severity system is in plain language and can be leveraged by every member of an organization, not only engineering. Without easily understood definitions, all incidents end up becoming SEV1.
We've mapped out a best practice severity list that every organization can leverage in their incident response below.
- SEV1: No customer can use most or all of a product or service. Most pages in our product are not loading or displaying an error message. Data corruption or loss has occurred or will occur. Loss of revenue is happening or imminent.
- SEV2: Primary product functionality is severely impacted and unusable. Customers are unable to utilize a common feature to its fullest ability. Data may not be displayed as expected but not lost. There is no workaround for customers.
- SEV3: Some customers (not all) are receiving intermittent errors on product pages or cannot use the product in possibly obscure ways. The product may be loading slowly or partially (missing images), and there is a workaround that customers can use.
- SEV4: An internal issue such as an account usage dashboard for company admins is inaccessible. Customer email notifications are slightly delayed, such as password reset emails. Platform problems such as a minor loss in platform redundancy of clusters.
- SEV5: Product problems that customers may notice but don't particularly care about. Sentiment in customer notifications is neutral or positive in that they're trying to help. Custom fonts not loading, comment likes are displaying a frowny face instead of a heart, etc.
Using these severities, anyone on a team should declare an incident with confidence that the severity they've set accurately indicates the impact. Everyone should be able to call the fire department after all.
What is Priority?
An incident's priority defines when it should be addressed. While severity is a primary driver of importance, several factors may come into play, too. For example, the number of currently active incidents or personnel available to mitigate might drive a SEV1s priority down to a P3. Conversely, lower severity issues impacting a strategic customer would warrant a higher priority. Practical priority definitions are short and in plain language.
- P1: Stop the world and pave calendars; nothing is more important than addressing a P1 incident.
- P2: Finish the current meeting you're in but cancel the rest until you resolve a P2 incident.
- P3: Tackled the next business day and should cause very little stress.
- P4: A P4 should be addressed, but next week is likely ok.
Priorities are also helpful to help understand when an on-call engineer should pivot their focus to a higher priority incident. For example, if an engineer is mitigating a P3 incident, but a P1 incident arises, the responder knows that the P3 incident should be postponed or delegated to someone else. Priorities also help organizations align expectations for a response, facilitating faster and more efficient communication with customers and stakeholders.
A powerful combo
Severity allows a stakeholder to quickly understand the impact an incident is causing, and priority is how fast a responder should react to the incident. These toggles help teams, and stakeholders rapidly understand the overall scope of an incident and the expected level of response. They're also crucial for enabling teams to build well-defined processes for when (not if) an incident is declared.