As an on-call engineer, you might deal with the day-in, day-out occurrence of alerts. These alerts may come from your alerting provider (PagerDuty, OpsGenie, etc.), Slack notifications telling you the site is down, or the ever concerning text message "Hey, is the site down?". These alerts elicit reactions that range from "shit" to "again?" and in many cases, both.
Alert fatigue, or occasionally referred to as alarm fatigue, is most commonly noted and studied amongst healthcare professionals. But it’s not hard to imagine that a never-ending stream of alerts will eventually fatigue software engineers as well. Take regular consumers for example. In the 1970’s, the average consumer saw about 500 advertising messages a day, but in the 2000’s it was estimated to be as high as 5,000, with the most recent estimates being as high as 10,000 messages a day. So, when it comes to software engineers and alerts related to their jobs, it's only natural that you feel tired and stressed after dealing with the noise that alerts bring. We didn't evolve as a species hearing and seeing our computer and phones' notifications as customers struggle to use our services. Our brains become desensitized to the constant over-inundation of alerts.
What science tells us
There are many studies on the effects of alert fatigue, cognitive, information, and sensory overload on people and their health.
Spoiler alert: It's not good.
In Overload and Boredom, O.E. Klapp says “...a large amount and high rate of information act like noise when they reach overload: a rate too high for the receiver to process efficiently without distraction, stress, increasing errors and other costs making information poorer."
And in A Study of Continuity and Change, John Feather describes information overload as "the point where there is so much information that it is no longer possible effectively to use it."
While this line from The Problem of Information Overload in Business Organizations certainly drives the point home "...there cannot be many people who have not experienced the feeling of having too much information which uses up too much of their time, causing them to feel stressed which, in turn, affects their decision-making. Concurrent with these phenomena is the anxiety generated by worrying whether an important piece of information has been missed in the volume of material that is being processed.”
As our technology systems grow more complex and serve more people, the combination of visual and audible alerts that come from operating those systems are also increasing. On-call engineers are responsible for keeping the lights blinking green. Continuous messages contribute to cognitive stress that is difficult to overcome for said engineers. These ongoing alerts eventually contribute to worsening outages due to overload, desensitization, employee burnout, and open the possibilities for significant errors.
In healthcare, “distraction has been shown to play a role in nearly 75% of medical errors, and studies have demonstrated that cognitive overload is a cause in 80% of medical device user errors.” While the impact of error may not be as detrimental as in healthcare, errors can still cost your business a pretty penny.
When it comes to alert sounds, continuous noise from operating complex systems keeps our brains in a high-alert setting which releases stress hormones. Humans evolved acute hearing millions of years ago when we were prey. We had to pinpoint predators, so it is no wonder we find noise stressful. It is hardwired for us to avoid becoming dinner.
What you can do to fight the fatigue
Convert noise to signal
In my experience of being on-call, every team has at least one alert that goes off that effectively tells us a computer is doing computer things. These alerts could be for high memory usage to a burst of requests suddenly. The problem with these alerts is that they tell the on-call engineer nothing of substance and, practically speaking, don't mean anything is wrong per se. Anyone that has ever cooked in a kitchen has set off a smoke alarm, but no fire risk was present.
One of the most important things your team can do is have a dedicated project to understand noisy alerts and either; remove them entirely, make them actionable, or fix the underlying factors that trigger that alert.
My suggestion is to create alerts based on customer impact, much like how Soundcloud does this with their service level objectives. Doing so accomplishes a couple of things:
It focuses the alert on something that matters to your business.
It aligns the on-call resolver to the functionality that is broken, not just computer vitals.
Having a single person or small group of people is a vulnerability to your team. While it may sound strange, it is possible to have people without an operations background on-call. It's better to split the alert triage responsibilities than to bake it all into the same escalation in many ways. Since every alert that fires won't be triaged and fixed by all engineers, it stands to reason to spread out the rotation. Including non-engineers can spread workload and risk to the team.
Reduce the stress during the incident
There will be weeks where it seems like the alerts won't stop. It's an unfortunate part of operating complex systems. Those moments where you can't reduce the stress from the noise of alerts, you should focus on the incident stress itself. It's a Sisyphean task to automate away all of your incidents. In actuality, you can't. You can, however, guide people to proper resolutions. Developing a simple checklist to get the wheels turning during high-stress situations can greatly help teammates perform better during an actual incident.
Beat the Fatigue
I’ve been on call for years, and I’ve personally felt the effects of constant pages. You can quickly devolve into a state where you no longer care about the reliability of a system. In order to overcome the inevitable fatigue that comes from neverending alerts, we must consider the healthiness of our systems as our own in some ways.
Strive for reliable systems, your health may depend on it. Literally.
See FireHydrant in action
See how service catalog, incident management, and incident communications come together in a live demo.Get a demo