The why and how behind running incident response game days

Documenting and training your team on your incident response processes is essential to ensuring a coordinated and efficient response effort. And training sessions, or game days, as they’re sometimes called, are one way to get everyone up to speed.

Jouhné Scottprofile image

By Jouhné Scott on 4/11/2023

In any high pressure situation, the key to fast action is preparedness. And that’s true when it comes to incidents, too. 

Documenting and training your team on your incident response processes is essential to ensuring a coordinated and efficient response effort. And training sessions, or game days, as they’re sometimes called, are one way to get everyone up to speed. 

Why run incident game days

Incident training game days help build confidence among engineers by giving them an opportunity to familiarize themselves with the incident response process and tools before they’re in the throes of a high-pressure incident

These training days provide an opportunity to poke holes in your processes — What’s not accounted for? Where are there gaps? — which you can then account for outside of an incident.

They also keep team members informed about your processes, whether it’s someone who recently onboarded and needs to catch up or a seasoned engineer who could use a refresher on old skills. 

For example, CircleCI uses a tutorial runbook that maps to production (but without the paging), and new engineers run game days as part of onboarding. Having a good practice framework for an engineer that’s only three months in — and may feel some anxiety around declaring an incident — is a huge benefit, they told us.

How to run incident response game days 

At FireHydrant, we hold game days when we roll out new product functionality, particularly when it impacts a core product. We do this in a few different ways. Sometimes, we’ll simulate an incident and run through how the new feature interacts with other parts of our app. Other times, we’ll find a previous incident or bug and test the feature in that scenario, seeing if it resolves past issues. 

Running your own incident response training doesn’t have to be complicated. Here are a few suggestions:

Set a time limit

Despite the name, game days should rarely last more than an hour. The longer responders focus on a singular incident, the more likely they will experience fatigue. An hour should be enough to go through the game day and then do a follow-up discussion. 

Depending on how the training goes, the follow-up discussion could take longer since allowing everyone to share their feedback is important. Also, sometimes things might pop up during your training that alters the timeline. Recently, we had to pause and fix a bug in the game day to continue the game day! 

Define your goals

Game days allow you to set goals you might not have time to think about during an incident. Have a plan of what you want your team to achieve before you begin your training session. Examples of goals include getting to mitigation before the end of the session or determining if your current alerting system is working well with your team’s needs. 

Remember that game days are learning exercises

Don’t use a game day as a “Gotcha!” moment for your engineers. Game days are development tools, not a way to place blame on an individual or specific team. 

Give it a go

To help your team, product, and response processes mature, you need to provide learning opportunities. Game days can reveal immediate improvements for your team to implement, helping you build a more reliable product and process.

See FireHydrant in action

See how service catalog, incident management, and incident communications come together in a live demo.

Get a demo