I’ve offered some tips up for folks who are on-call during the COVID-19 crisis, but I thought it would be helpful to get some more ideas from people with different perspectives. So I reached out to some people I trust to see what they had to say. They all have different viewpoints, but some themes emerge, like managing alerts, having empathy, and practicing self-care.
The participants, in alphabetical order:
Aaron Aldrich is a Developer Advocate at LaunchDarkly, with a focus on DevOps. Jeff Smith is Director of Production Operations at Centro. Lorin Hochstein is a Senior Software Engineer at Netflix, where he focuses on helping teams learn from incidents. Tiffany Longworth is an SRE Manager at Zapproved.
I want to thank them all for sharing their thoughts with us. If you have additional ideas for teams that are on-call right now, feel free to leave those in the comments below.
It’s an entirely different discussion for me to talk about good Observability and release engineering practices, though they definitely _help_ with on-call cycles, so instead here’s some advice on what to do _right now_ without any engineering changes:
Give yourself space to unpack the stress of being on-call. Globally, we’re going through a long, drawn-out trauma. There’ve been articles and conversations about how we’re all going through grief right now. And to make it worse, you’re dealing with another trauma, another fight or flight event, every time you hear that Pager Duty alarm go off. Now, more than ever, take time off after you handle an after-hours incident. Create a policy on your team that encourages you to take real downtime immediately following your incidents and ensure that your teammates step in to create the positive peer pressure to _actually_ do this and make sure down-time is spent in a way that helps you each individually destress.
Managers, be extra-sensitive to your reports and how they’re doing. Many of us feel like we should be able to continue to perform like we did before we entered a global emergency and the reality is, we can’t. Asking for help is hard and vulnerable. Consider just making blanket policies that give people more room and more mental space: Everyone gets a rotating one extra day off per week, reduce standard workload globally, or maybe most importantly, start to put a high priority on all those little quality of life improvements sitting on the backlog. Brainspace is at a premium, every step to reduce toil, everything that means your engineers get to sleep just a little more soundly at night, is a high priority. Your team can’t ship production features if they’re all burnt out toiling on malicious systems. Or hell, I don’t know, maybe give everyone spending their increasingly precious self-care time on keeping your products running in production a raise.
The first thing for all teams to recognize is that this is not business as usual. Even if your work is easily performed remotely, even if you’re normally 100% remote, it’s not normal. You’re occupying several personas at the same time that seldom overlap. Partner, teacher, parent, son/daughter, employee, manager, and any other role you put on are now all occupying the same space often at the same time. So, the first piece of advice I’d give is to acknowledge that fact and cut yourself a little slack. Take breaks, walk away. It may feel like you’re not giving the full 8 hours a day, but when do you really ever work a full 8 hours? Between drive-by conversations, water cooler gossip, and the occasional long lunch break, the full 8-hour day is less common than our quarantined selves would like to believe.
It’s also very important that you get that break away from the home office, especially for on-call teams. If you work in a noisy-alert shop, tackling the noisy alert is more critical than ever. Noisy alerts are often driven by evaluation periods that are too short to let peaks and valleys in the metric run their natural course. What this means is an alert on CPU usage fires after a mere minute or two of high utilization. Maybe there’s a task that consistently uses a lot of CPU but abates after three minutes. Is it worth alerting on that? I say tune your alerts to allow for a larger evaluation period. Instead of alerting on high utilization after 2 minutes, widen it to 5 or even 10 minutes. This allows peaks and valleys to happen without alerting. Plus, with everyone being quarantined, I’m guessing the teams are a lot closer to computers and are quicker to respond if the alert is an actual problem and you’ve lost a few minutes due to the longer evaluation period. Tweaking the evaluation window also helps if you can tie alerts to business impact and monitor the business impact versus the symptom. If your alerting system allows it, consider creating composite alerts, two conditions that must be met at the same time in order to alert. If a CPU is burning hot but there’s no business impact, do you need to be interrupted? Probably not.
Everybody is working at reduced capacity, and many organizations that are used to being co-located have been forced to shift to distributed mode. This means that coordination will be harder, and incidents will take longer to diagnose and remediate. In addition, many services are seeing increased load, which increases the chances of an incident.
People will adapt to their new circumstances and get better at coordinating on incident response as these incidents happen. In the meantime, it’s going to be bumpy: people won’t always respond as quickly as you’d like, or be as precise in the wordings of their Slack messages. When this happens, It will help to develop an additional store of patience and empathy for your co-workers. We’re all in this together.
My advice for on-call right now is to hold space for yourself. Being on call is stressful enough when there’s not all of this *gestures broadly at the world* going on. What makes us good at being on call is being able to keep a clear head when dealing with urgent issues, but for many of us, our emotional resiliency is already being stretched. We might have been able to “grin and bear it” before, but a lot of the people I know are running out of grins.
Put some time on your calendar to sit and really reflect what you are capable of doing without reaching your breaking point right now.
Ask your team and manager if you can push deadlines back, share workloads, or just not do low-value work. By pushing back on artificially inflated urgency and performative business, you free up more of your emotional reserves for dealing with *actually* urgent work that comes in while you’re on call.
Also, see if you can take a day off afterward to recharge. On-call is a sprint. Working through a global pandemic is a marathon. Pace yourself.
Don’t forget that you are part of your systems, too, and your care and maintenance are essential to the business.
See FireHydrant in action
See how service catalog, incident management, and incident communications come together in a live demo.Get a demo