How to run a fire drill
Fire drills, also called "gamedays," help align your incident response team and prepare them for real incidents by working through a mock incident scenario.
We recommend running regular fire drills both to improve familiarity with the FireHydrant platform as well as refreshing and training any new changes made to your incident management processes.
If you're testing by yourself or with a small group, ensure the rest of your organization knows it's just a mock incident. If doing a larger team-based drill, schedule it on the calendar but communicate clearly that this is only a practice.
Most importantly, throughout the mock incident, don't worry about perfection. Chances are you'll learn things from your first fire drill and make some changes to your runbook(s).
Although it's meant to be stress-free, not having an agenda can lead to wasted time. It's best to have a plan and choose scenario you will play out.
We recommend using a recent incident your team handled. Having a real, previously-experienced scenario gives participants a clear image and comparison of how things go.
As stated above, the goals of a fire drill are to improve familiarity with FireHydrant and overall processes as well as identifying gaps. Make sure your team members aren't hung up on extraneous details or dive too far into the weeds on anything (e.g., "well ackshually, that incident was caused by clock skew in a distributed cluster, and after searching through 10 GB of logs...").
The point of fire drills is to practice and improve. There are no wrong answers or bad questions, and (hopefully) nothing important is breaking! Reliability is a company-wide metric, and every responder's feedback is important to consider.
These are the recommended steps for working through your fire drill/gameday incident. You are welcome to take these and modify them; they are provided as reasonable defaults based on numerous past customers we've worked with.
There are a number of ways you can do this depending on how you've configured things. The simplest way would be to enter
/fh new into Slack. Use the
GAMEDAY severity to signify this incident is a fire drill.
Teams feeling fancier and more confident can configure Alert Routing and potentially trigger a FireHydrant incident from an external source like PagerDuty.
After the incident starts, your automations in Runbooks should have kicked off. For many users, this involves the creation of an incident channel, a meeting bridge, and various other things like assigning task lists, teams, and more.
The team(s) or individual(s) assigned should respond to the incident. From here, workflows can diverge greatly depending on organization processes and needs, so we will provide a list of common Slack commands run or actions taken to move the incident forward.
You can execute some or all of these, and in any particular order depending on how your team handles incidents.
Assign teams and roles
/fh assign team /fh assign role
Usually, FireHydrant recommends assigning roles and teams as part of Runbook automation. However, for rudimentary incident management, or to pull in additional personnel as needed, you can manually assign them with the above Slack commands.
Users who have been assigned to an incident will receive a DM in Slack about their assignment, and we will also automatically add them to the incident channel. For a refresher on this, see Users, Teams, & Roles.
Chat and Upload Artifacts
As a normal part of incident management, users will often chat and post messages, screenshots, and more into the incident channel. FireHydrant tracks all of this activity as part of the incident timeline.
While working through the incident, if a particular message or item is important to the incident overall, we recommend Starring it. For a refreshers on these, see:
/fh tasks [@user | unassigned | all]
You can create tasks ad-hoc or predefine lists of tasks and assign whole lists. For a refresher, visit Managing Incident Tasks.
Updating Incident Details
/fh update /fh add note
As you work your way through the incident, you'll want to update things like the Milestone, post incident updates to both timeline and Statuspages, and more. For a refresher, visit Posting Incident Updates.
Viewing Service Info
/fh service [service name]
You can directly browse services from the service directory for information on responding/owning teams, external links, who's on-call, and latest known changes, if configured. For refreshers on this, see:
Note: Requires configured service catalog.
/fh on-call /fh page (service | functionality)
Teams sometimes want to see who's on call for specific services and functionalities as well as page them.
Note: Requires a configured alerting provider and linked services. See [Importing services] and Linking external services for refreshers.
Once you've gone through your mock scenario and the incident is deemed "resolved", resolve it in FireHydrant. This will officially close the incident and mark all impacted components as "Operational" again.
Now that the incident is resolved, it's time to run a retrospective! For the most part, you can run your incidents entirely out of Slack, however, retrospectives need to be done inside the FireHydrant UI. We recommend:
- Reviewing the timeline of starred/important events
- Recapping the description, customer impact, impacted components, and involved personnel
- Going through and logging the Contributing Factors that led to the incident
- Answering any questions you've configured in Lessons Learned
- Reviewing completed/open Tasks, and creating any Follow-Ups as needed
- And then finally, publishing the retrospective and exporting to PDF or other destinations
For a refresher on Retrospectives, visit Creating and Running Retrospectives.
Tip: You can re-order and modify the Lessons Learned questions.
Once you've completed the retrospective, head over to the analytics section of the FireHydrant UI to see what types of metrics we track. For a refresher, see Analytics: Getting Started.
The most important questions are essentially byproducts of the default questions we include in the Retrospective.
- What went well?
- What were the best parts of running through the incident on FireHydrant?
- What did responders like or find a lot of value in?
- How can the team double down on these and ensure consistency and training for all responders?
- What could be improved?
- What were some areas where the process hiccuped?
- What did responders not like?
- How can the team improve these pieces and revisit the improvements for feedback?
- Where did we get lucky?
- Was there anything that went well unintentionally?
- How can the team take that and repeat it again so that it's intentionally good next time?
- What were we wrong about?
- What assumptions were proven wrong?
- Is there a better way to do things than the way we want to do things?
By properly reviewing the outcomes of fire drills, just like with incidents, you can ensure your team has learned from the drill and will make improvements both for next time and for actual, real incidents.
We hope this guide has been helpful to you. As always, if you have questions, you may reach out to our support team and/or your account team, if you are working with one. Happy firefighting!