Captain's Log: Diving into our scheduling design

On-call scheduling is tricky. Like, really tricky. It was one of the scariest parts when we decided to build a modern alerting system earlier this year. We knew we couldn't cut any corners on Day One of our release because it needed to be a fully loaded feature for someone to realistically use our product (and replace an incumbent).

This meant including windowed restrictions, coverage requests, and simple to complex rotations. And after many months of seeing our scheduling design in the wild, I'm excited to give a detailed technical overview of how it works. Buckle up!

`Schedules != Shifts`#schedules-shifts

At its core, an on-call schedule is a list of shifts ordered by their start and end times. And that’s exactly how we modeled the database for on-call schedules. It looks like this:

We chose this approach because:

It allows reassigning a shift to another user seamlessly without screwing up the entire rotation.
We can create one-off changes on a schedule that is outside of a strategy (i.e. maintenance window on-call).
It makes summarizing how much time someone has been on-call drop-dead simple in analytics.

The purpose of shifts#the-purpose-of-shifts

In Signals, a shift is a simple record that stores a start and end time and the on-call user during that period. We create shifts based on a schedule's strategy and allow users to create ad-hoc shifts. Here is a first look at the page that defines how we create shifts for being on-call:

When a schedule is created, a background task is scheduled to create every shift based on the strategy selected and any restrictions applied. This task usually takes a few seconds but creates shifts for the next six months. The same job will continue to create shifts every day for every schedule to guarantee there are always at least six months of shifts created.

Restrictions#restrictions

One of the most complex parts of creating shifts for an on-call schedule is masking the start and end times with a defined restriction. Shift restrictions are a necessary part of an on-call system because they enable teams to:

Create follow-the-sun rotations
Have off-hour-only shifts
Create shifts with lunch breaks built in

For example, below is a schedule restricted to only Monday-Friday, with a lunch break built into the middle.

This creates several shifts for the given windows:

How we approach coverage requests#how-we-approach-coverage-requests

Any great on-call software should support overriding upcoming shifts. Because we separated shifts and schedules, this feature was far more straightforward to implement than if we were to overload the schedule logic itself.

When someone requests coverage (using Slack or the UI), we split the shift they're requesting coverage for into two shifts (or three, if they're in the middle of a shift) and allow someone else to claim the new shift period. Here's a snippet of the code in production that does this:

Loading the configuration#loading-the-configuration

As we wrote in the first Captain's Log, we're very focused on resiliency regarding Signals. Laddertruck, the application where our API and UI live for configuring on-call schedules, can fail, and Signals can still be dispatched to on-call engineers.

We accomplish this by serializing schedules and their shifts into protocol buffer messages and storing that in object storage. Here is a snippet from that message definition:

When a schedule is targeted by an escalation policy (or other harness) to send an alert to an on-call engineer, we build an interval tree for all the shifts in the schedule. We're using the intervalst Go package in Siren to create this data structure that enables us to rapidly find the current shift and, therefore, the user we need to notify of an incident.

This code enables us to route an alert to the on-call engineer when a Signal comes in based on the current list of shifts for a schedule. And it's lightning fast.

Wrapping up#wrapping-up

By separating shifts from schedules, we've made a robust on-call system included in the launch of our open beta for Signals, coming the first half of this month. This design and architecture have been in production for months now, and it continues to impress me with its simplicity and effectiveness.