Outage

What is an outage?#what-is-an-outage

When an outage occurs, parts of the business system go offline or are downgraded to a level that they are unusable. Through the process of incident response, engineers, developers, and other involved stakeholders work quickly to mitigate the issue that caused the outage.

Outages happen for a variety of reasons. A few common causes include:

Spikes in web traffic that the application cannot handle
Human error, such as misconfiguring servers or failing to stay up to date on patches
Software or hardware failures
Extenuating circumstances, such as natural disasters

How do outages affect organizations?#how-do-outages-affect-organizations

An outage can cause adverse effects on an organization, so getting it remedied fast is of the utmost importance. Uptime Institute found that about 60% of outages generate at least $100,000 in total losses, and 15% of outages cost around $1 million.

In addition, outages cause a loss of customer trust. In some cases, outages threaten the organization’s ability to meet service-level agreements (SLAs) or a service-level commitment (SLC): a formal statement establishing a standard of application performance the organization must meet. Failing to meet an SLA or SLC can cause significant consequences to an organization, such as penalties or even early contract termination if failures keep occurring.

What are some best practices for fixing an outage?#what-are-some-best-practices-for-fixing-an-outage

To ensure that outages get fixed as quickly as possible, teams should build an incident response plan ahead of time. The most effective incident response plans include the following best practices:

Assign service ownership#assign-service-ownership

Service ownership means that every service has a team assigned as its owner. In many cases, this team is the same group that originally built the service and was responsible for its development and release. When an incident occurs and involves this specific service, the assigned team must lead the charge of fixing it.

Many organizations use a service catalog to outline essential information, such as what each large product surface looks like, which services power that surface area, and which team owns each service in that graph. This way, nobody is unclear on who owns which services. Our Incident Benchmark Report uncovered a 36% decrease in mean time to remediation (MTTR) when a service catalog is used.

Clarify incident roles#clarify-incident-roles

Even if specified teams own each service, an outage often involves multiple services and teams. Because of this possibility, it’s essential to assign a specific Incident Commander role and other support roles like a mitigator and planner. Clarify the responsibilities of each involved party before the outage even happens.

Formalize an incident declaration process#formalize-an-incident-declaration-process

Establishing, formalizing, and documenting an incident declaration process to follow when an outage occurs is essential. This process will serve as a way for anyone on your team to “dial 911” when an incident happens. Your team will need to make a few decisions about this process. Ask questions like:

Who declares an incident (in most cases, the answer should be everyone in your organization)? How do we set up the infrastructure to allow them to do so?
Do we use automation to kick off an incident, manually complete all steps, or use a mix of both (human-in-the-loop)?
What type of incident process do we use? A single process for all incidents? Or a defined process for every severity and product area? Or another approach that’s in between these two options? This decision will mainly depend on the complexity of your system.

Run incident retrospectives#run-incident-retrospectives

You should also consider running incident retrospectives after an incident occurs. A retrospective gives your team the space to discuss what went well, what didn’t, and how to improve the overall process. By conducting incident retrospectives, you empower your organization to learn from each outage and grow from every experience.