Mean time to assembly (MTTA)

What is mean time to assembly (MTTA)?#what-is-mean-time-to-assembly-mtta

Mean time to assembly is a metric that measures the average length of time between when an incident is declared and when mitigation efforts actually begin, or the right people arrive in the right place with the right information to solve the problem.

This assembly process can include tasks such as:

Creating a Slack channel or Zoom bridge
Notifying stakeholders
Updating a status page
Locating relevant logs, dashboards, or deploys

While each incident differs in technical details and remediation steps, these MTTA conditions are often the same for every incident, making them far easier to manage and control.

Why does MTTA matter?#why-does-mtta-matter

Often, it’s harder to get the right people in the right place than it is to actually solve the technical issues. Focusing on MTTA means controlling the knowns of an incident so you are better equipped to handle the unknowns.

Best practices for improving MTTA#best-practices-for-improving-mtta

To improve MTTA, your team must focus on the factors you can control. A few best practices to remember include:

Defining services in a catalog
Building and streamlining communication
Creating a single source of truth, such as a playbook
Practicing your incident response plan with “game days”

Define services#define-services

Define services and identify their owners in a service catalog. If you document all of your services and their owners ahead of time, you can quickly bring in the right people if an incident occurs.

Some organizations choose to automate this service catalog. If someone declares an incident involving a specific product area, this automated service catalog will automatically pull in the engineer responsible for that area.

Build and streamline communication#build-and-streamline-communication

Use runbooks (automated incident workflows) to set up the proper communication channels and automatically announce updates in existing ones. These runbooks can automate the following tasks:

Setting up an individual Slack channel for the incident
Sending notifications in a general incidents channel
Adding internal or external status pages
Creating a Zoom bridge for the incident

Define and automate your processes#define-and-automate-your-processes

Make it easy for your response team to find the information, access, and tools needed to start working on the incident. This begins by creating a source of truth for them to follow.

You can start with a playbook: a list of tasks that must be completed at the beginning of every incident. Once you level up your incident management program, think about streamlining that information and creating tiers of response processes for different services or severities.

Some teams automate these kickoff tasks. Once they declare an incident, it triggers a waterfall effect of next steps.

Practice and iterate#practice-and-iterate

Keep practicing your incident response process with “game days.” As you observe successes and failures during these team trainings, refine and iterate on your process. This will make your strategy the best it can be when an actual incident happens.