Remediation

Incident remediation means completely fixing an incident and fully returning a system to its normal state.

What is incident remediation?#what-is-incident-remediation

Incident remediation focuses on entirely eradicating the issue that caused a system incident. By comparison, mitigation temporarily minimizes the effects of an incident to get the system back online as the team works on a more sustainable fix.

A metric called mean time to resolution (MTTR) gauges the speed of incident remediation. This measurement shows how long it took the team to resolve an incident. Although definitions vary by company, generally to be considered “resolved,” the system must meet these criteria:

The system is confirmed to be fully operational without relapse.
Responders have removed any temporary fixes that mitigated the issue.
The system behaves as expected.

Consistent incident remediation requires a proactive incident management strategy. Because remediation focuses on eradicating the issue and takes lots of behind-the-scenes work, an organization must plan and prepare before an incident even happens.

Why is remediation important?#why-is-remediation-important

Remediation fully resolves an incident's open issue, enabling teams to move forward without facing additional problems caused by the same issue. A robust incident management plan is the backbone of successful remediation, empowering teams to work proactively rather than reactively.

Best practices for incident remediation#best-practices-for-incident-remediation

To prepare your team to remediate an incident successfully, you can start with these three best practices:

Set up a consistent process#set-up-a-consistent-process

Your team should have a pre-established process to follow as soon as an incident gets declared. This plan must include the right people, informed by the correct alerts and channels. To create this initial communication plan, consider the following questions:

What will you alert on: service level obligation (SLO) violations only or every service incident?
Who gets alerted and when? These factors will depend on your company's size and the incident's severity.
How do you communicate with the right stakeholders? This process includes assembling the right people in the right place to start solving the problem. It also involves updating the appropriate internal- and customer-facing information channels (such as status pages).
What’s the process for each involved team? The workflow for the engineering team will look very different from that of customer support.
What are the most likely scenarios? While the unexpected will inevitably happen at some point, it helps to model out the most likely scenarios with a timeline and action items.
Will you automate any of these process steps? Many organizations use runbooks to rapidly complete the rote aspects of an incident management process, such as setting up a Slack channel for all involved parties or publishing regular updates to a status page. This level of automation empowers teams to focus on critical problem-solving rather than using their time and resources to complete menial tasks.

Declare ownership around product areas#declare-ownership-around-product-areas

Before an incident even happens, create a service catalog that shows the dependencies, owners, and rollback plans for every service. Most importantly, every product should have an engineering team assigned as the “first responders” if an incident happens. This clear delineation of ownership enables your teams to jump into action right at incident declaration, leading to faster remediation.

Conduct retrospectives#conduct-retrospectives

Even after your team has resolved an incident and is breathing a sigh of relief, you haven’t finished the whole incident management process yet. Your team must conduct a retrospective after an incident gets wholly resolved. Retrospectives give responders a space to process their experiences, deeply understand the incident’s causes, refine the response process itself, and improve the whole customer experience.