Assembly time is where you have the most control of an incident
Although we can’t control how long it might take to mitigate an incident, we can exercise a great deal of control over how quickly and prepared we get to the scene of the problem. We call that phase of the incident lifecycle “assembly time.”
By Robert Ross on 5/4/2023
The FDNY EMS Command responds to more than 4,000 calls per day. They range from car accidents to building fires to cats stuck in trees, and responses vary accordingly. Sometimes they might take hours, sometimes they take just a few minutes. With such unpredictable conditions, the FDNY focuses on improving what they call “response time.” That’s the amount of time between a 911 call being made and emergency responders arriving on the scene.
This might sound familiar. As incident responders, we also face a variety of incidents, ranging from an issue with delayed forgot-my-password emails to SSL certificates expiring on production. Luckily, lives aren’t on the line with our incidents, but often a lot is.
SLA breach paybacks, lost revenue, employee attrition, reputation hits, and the list of costs associated with incidents goes on. In fact, the Uptime Institute’s 2022 Outage Analysis said that 60% of organizations reported failures resulting in at least $100,000 in 2022. Clearly, anything we can do to decrease the costs of incidents (to ourselves and our companies) is valuable.
And just like the FDNY, although we can’t control how long it might take to fix a problem, we can exercise a great deal of control over how quickly and prepared we get to the scene of the problem. We call that phase of the incident lifecycle “assembly time.”
In this blog post, I’ll define assembly time and explain how streamlining incident response processes can improve it.
What goes into incident assembly time?
Assembly time is the phase of the incident response process that begins once an incident’s been declared (aka as soon as your team knows the incident exists) and ends when the right people arrive in the right place with the right info to start solving the problem.
One way to think about what goes into assembly time is: What are the tasks that need to be done before we can start solving the problem? Generally speaking, the incident manager or incident commander — often the one who either discovered the incident or was on call when the incident began — is responsible for completing those tasks.
What’s included in assembly time will vary based on your company’s incident response process, but here’s a rundown of what it might look like.
After the incident manager identifies the affected service(s), they’ll generally want to bring in the related subject matter experts. In a “you build it, you run it” organization, this could mean the person who maintains your OAuth API or owns the product functionality for logging in.
Depending on the impact of the incident, “the right people” might also include a representative from customer support (for external issues), as well as stakeholders from engineering, marketing, legal, etc. (If you haven’t already defined roles for your incidents, now is a great time to start.)
Of course, all these folks need a place to work together. For many organizations, that’s a Slack channel and often also includes a meeting bridge, like a Zoom room, for example.
But, not everyone who needs to be in the loop also needs to be in that Slack room (in fact, that’s probably a recipe for disaster), and of course, you’ll want to track the incident and follow-up work through Jira. So assembly time might also mean creating tickets, and kicking off communication through internal or external status pages, or other channels.
After all of that’s done, you should have the people who know the most about the problem area in a dedicated space ready to investigate, and, ultimately, solve the problem. Assembly time over.
How do you improve incident assembly time?
Let’s go back to that FDNY example and think about a few things they do that help them respond faster:
They have everything in place and ready to go. There’s no wondering where the keys to the firetruck are because they have a protocol for where things live and how they should work.
The call is automatically routed to the closest responder. Firefighters know their regions. They know the fastest way to go, they know the buildings. They have areas of specialization.
They practice. You can picture it in your head: the bell goes off, the firefighters slide down the pole, throw their gear on, get in the truck and they’re off. Everyone knows their role and what to do — before it’s time to do it for real.
We can learn from this. And it doesn’t have to be intimidating. In incident management, I think there’s this pressure that folks sometimes feel to go from zero to everything immediately. It doesn’t work that way. Implement the start of best practices now, then grow them as you mature your program.
Rally around services
For example, start with defining services and their owners, then documenting that information. This can be done at a higher level (or for monoliths) by breaking your product up into functional areas and owners. Then go further by automating this service catalog, so when an incident involving a certain product area is kicked off, the engineer responsible for that area is automatically pulled in.
Build in communication processes
Start to streamline communications by creating a solid plan for how and where you’ll work on the incident together. For example, maybe you have a general incidents channel in Slack, and anytime there’s any incident, that’s where everyone goes.
As you mature, think about using an individual Slack channel for each incident to serve as a record of the incident that you can look back on later for learning, retros, or reporting. You might also consider adding internal or external status pages to keep stakeholders posted.
This is another thing you can automate, by the way. For example, when we kick off an incident at FireHydrant, we have our runbooks configured to automatically create a Slack channel and a Zoom bridge for the incident, as well as notify a company-wide incidents channel that we’re all a default member of.
Create a single source of truth
You might notice all of these recommendations start with essentially declaring a process. You have to make it dead ass simple for your response team to have the information, access, and tools they need to get to working on the incident — and that starts with creating a source of truth for them to follow.
To start, that might mean simply outlining a list of tasks that must be completed for each incident. Once you start to level up though, think about streamlining that information and even creating spinoff processes for different services or severities. Ultimately, you can think about automating all of this as well so that all you need to do is declare an incident to trigger a waterfall effect of next steps.
Put it to practice
And then, of course, there’s practicing. No matter how lightweight your process is to start, there’s value in holding team training to ensure a coordinated — and speedy — effort.
When it comes to resolving an incident, solving the technical problem is rarely the hard part. Over my years as a responder, I’ve found that it’s often much more difficult to get the right people in the right place at the right time — but that’s also where these good incident response practices have the most impact. Start moving toward controlling what you can today.
See FireHydrant in action
See how service catalog, incident management, and incident communications come together in a live demo.Get a demo