The new principles of incident alerting: it’s time to evolve

In the ever-evolving world of software engineering, the landscape is constantly shifting. New technologies emerge, best practices evolve, and how we build and run software continues to change. However, when it comes to incident alerting, it often feels like we're stuck in the past.

Teams are moving to a service ownership model, reliability is being measured with SLOs, and engineers are wearing out the ball bearings of their swivel chairs. Today’s incidents often require multi-pronged responses from multiple people or, in some cases, various teams. Yet current alerting offerings from tools like PagerDuty, Opsgenie, and VictorOps remain largely the same. They're priced the same way, do the same thing, and do little to change how we think and react to incidents.

It's high time we acknowledge that the way we’re running alerting is stuck in time. To move forward, we must lift the curtain on areas that have to evolve and embrace the new principles of incident alerting.

The alerting cost conundrum#the-alerting-cost-conundrum

Alerting tools have become notorious for their exorbitant costs. Many of us have scratched our heads, wondering why these tools come with such a hefty price tag. It's not just a matter of budgeting; it's about value for money. In today's economic climate, where CFOs are keeping a close eye on every penny spent, the price of alerting tools has come under scrutiny.

The new principle: cost efficiency#the-new-principle-cost-efficiency

It's time for a paradigm shift. We need alerting tools that not only deliver exceptional performance but are also cost-efficient. The future of incident alerting should prioritize affordability without compromising quality.

The way forward is active user invoicing, so you only pay for users paged that month. Alerting is a must-have for most businesses building software in 2023, but they shouldn’t have to pay for hundreds (or even thousands!) of seats that are never used.

Scheduling and substitution woes#scheduling-and-substitution-woes

It's baffling that we're still dealing with outdated methods for managing on-call rotations. Even simple round-robin logic took years to implement, and we're still unable to handle everyday tasks like temporary coverage seamlessly. This can take a serious toll on on-call responders and lead to a cultural drain that can impact morale, productivity levels, and turnover.

The new principle: seamless scheduling#the-new-principle-seamless-scheduling

It’s time to make scheduling compatible with life. We should be able to schedule and manage on-call rotations with ease, right from within the tools we use daily, like Slack. The future of incident alerting means no more hiccups when someone needs to step away from their desk or switch shifts.

Service directory: a make-believe world#service-directory-a-make-believe-world

The concept of service directories has become a facade. Many organizations use them as a mere representation of teams because there's no better way to page them. This complicates the incident management process and hinders true service ownership.

The new principle: empowering service ownership#the-new-principle-empowering-service-ownership

Our alerting tools should enable true service ownership by allowing teams to be seamlessly integrated. No more hacks or workarounds; it's time for incident alerting tools to embrace the reality of service ownership.

The way forward focuses on notifying teams (not services) about incidents. Alerting and service catalogs should be appropriately integrated in order to scope escalation policies, signal rules, and schedules to a team. This means everyone knows what to do and how to jump into action fast.

Alerts vs. incidents: a confusing mix#alerts-vs-incidents-a-confusing-mix

Alerts and incidents are often treated as one and the same, leading to noise and confusion. This blurs the line between what truly requires immediate attention and what can wait, preventing us from gaining valuable insights.

The new principle: a clear distinction#the-new-principle-a-clear-distinction

In the future of incident alerting, we need a clear separation between alerts and incidents. This will allow us to accurately measure metrics like alert-to-noise ratio and mean time to detect, giving us a deeper understanding of our systems.

Having clear-as-day analytics can empower teams to have data-backed discussions about which alerts they need and which they can drop entirely. That means happier on-call teams and faster assembly time.

Let’s evolve alerting#lets-evolve-alerting

The time has come for a revolution in incident alerting. The problems we've encountered with current alerting tools are not insurmountable. By embracing these four new principles — cost efficiency, seamless scheduling, empowering service ownership, and clear alert-incident distinction — we can usher in a new era of incident management that aligns with the dynamic nature of modern software engineering.

So, let's leave the past behind and step confidently into the future of incident alerting. Together, we can build a more efficient, effective, and innovative incident management ecosystem that benefits us all. Learn more about how we’re doing that with Signals, and sign up to be notified when we open beta.