A better way: 3 incident response areas prime for automation
By automating some rote parts of incident response, you reduce decision fatigue and help responders get to solving the problem faster with less stress. In this post, we talk about three areas of the incident response process that are prime for automation.
By Robert Ross on 1/3/2023
Although the circumstances of an incident can vary widely whether we’re talking about a catastrophic outage or a minor bug, incidents do generally follow a similar path and have similar response needs.
For example, maybe your process calls for starting a Slack channel no matter the severity of the incident, or notifying a select group of stakeholders when certain criteria are met. These are those often rote tasks that need to be accomplished during an incident while you’re also trying to actually mitigate the incident. And therein lies the trouble.
The more decisions we have to make or tasks we have to remember to do while we’re actually responding to an incident, the more likely we are to make a mistake or forget something. Or in more direct terms, “mo’ decisions, mo’ problems.”
It might come as no surprise that at FireHydrant, we’re in favor of automating as many of those rote response tasks as possible. This involves some effort on the frontend because you need to define rules for what happens when, but the payoff lies in reducing decision fatigue for your responders and helping them get to actually solving the problem faster (and with less stress).
Let’s talk about three areas of the incident response process that are prime for automation.
Incident assembly: Easily kick things off
Assembly is, in its simplest form, the process of making a place for people to communicate and then bringing the right people to that place.
For some organizations, that means creating a dedicated Slack channel and a Zoom bridge; for others, it means posting a message in a broader channel and creating a Jira ticket. For some, it involves notifying the owners of the affected service(s) while for others it means rallying the dedicated incident response team. Whatever the process is, it usually involves tasks like triggering, creating, paging, bat-signaling, calling, and messaging — and many of those tasks need to happen simultaneously.
At a minimum, your documented incident response process should provide a checklist of the tasks required, but this is a great example of an area where you can go a little further and automate these tasks. For example, set up a trigger in Slack to post that checklist every time you mention a new incident or get a tool that completes those pre-configured tasks automatically.
By leveraging automation, you cut seconds or minutes off the time needed to kick off your response efforts and decrease the likelihood of forgetting a step. But what’s even better is that you put responders in a position to more quickly jump into problem solving. So you’re not only providing a consistent workflow and minimizing errors, you’re also using your responders’ skills in the most effective way possible — to fix the problem.
Incident communication: Proactively update everyone
We know that the higher the stakes of an incident, the more people want to know what’s going on. Setting expectations for strong, proactive communication efforts internally and externally ultimately build trust during an incident. And that trust goes a long way in buying grace during mitigation efforts. But there’s definitely a balance that needs to be struck between keeping everyone in the loop and focusing on mitigation.
For example, maybe your response process requires you to communicate updates in a certain Slack channel every 30 minutes but also to the executive stakeholder email list (because they’re not in that Slack channel) and also to your public status page every hour. Those are a lot of boxes to check while you’re also actively working with the team to mitigate the incident.
One way to automate this process is to set reminders in your Slack channel to update communication efforts so often. But an even more seamless way would be to identify what communication tasks need to occur based on severity levels and then set triggers in your incident response tool for those. This helps ensure the proactive communication patterns you want to be known for during an incident but takes the task of remembering to communicate them (and to whom and how) off your plate.
Incident documentation: Never lose a detail
Whether for compliance purposes or simply to have a running log during the incident retrospective, most response processes call for some kind of timeline documentation. But you can’t assume you’ll remember everything clearly after the incident is remediated. Time gets fuzzy and distorted during incidents, and often, things move so fast it can be easy to forget something. For this reason, a chronological storyline of what happened during the incident is a must have.
We say go a step further and write down anything you think could be relevant. Later, you can go back and review what happened to determine what changes can be made to improve your process or your systems. That context is gold in understanding the lifecycle of an incident.
But once again, this adds a level of overhead to the response process. Someone has to ensure that all of that documentation is happening. And, depending on the impact of the incident, it can be an arduous process and critical insights can be overlooked.
By automating your response process, you can have your tool be your scribe and automatically export everything from the incident Slack channel, along with all machine communication (like alerts, deployments, and rollback events), into a Google doc for use in the retrospective later. This ultimately equips you with detailed information that makes it easier to not only run, but also learn from the retro.
Give it a try
Automation shortens the list of rote tasks any responder has to complete during an incident, it’s true. But automating parts of the response process also makes learning from the incident more accessible to any organization — and that’s where incident management truly pays off and becomes an investment in more reliable software for all. It’s an investment worth making, regardless of the path you choose: building your own tool, integrating IFTTT actions, starting simple with Slack reminders, or trying out a tool for response automation like ours.
See FireHydrant in action
See how service catalog, incident management, and incident communications come together in a live demo.Get a demo