Talking incidents

Definitions for incident responders

Alert: An alert is a notification sent to a predetermined person or group when certain criteria are met. For example, an alert might be triggered by a certain number of unsuccessful login attempts and sent to the on-call engineer at that time.

Read more
Alert fatigue: Alert fatigue is a type of burnout that exists when on-call engineers are getting paged to deal with incidents too often. Sometimes alert fatigue can be so severe that teams become desensitized to incidents and, as a result, may fail to properly respond to incidents.

Read more
Cycle time: Cycle time is a performance metric measuring the time it takes from one incident milestone to the next. Some companies track cycle time to evaluate the health of their incident response efforts.

Read more
Declaration: An incident declaration is when someone at an organization officially documents a system degradation and kicks off the company's incident response process.

Read more
Impact matrix: An impact matrix, or impact effort matrix, is a tool for measuring the potential business impact and the amount of effort required to complete a task. Organizations use impact matrices to prioritize assignments.

Read more
Incident: Although companies define incidents in varying ways, in general: An incident is any unexpected degradation or interruption of functionality in a company’s product, systems, or website that is of such a degree that it’s noticeable to customers and/or internal stakeholders.

Read more
Incident commander: An incident commander is accountable for resolving an incident from beginning to end. They also lead throughout the incident, directing colleagues to conduct mitigation and retrospective activities.

Read more
Incident management: Incident management is a proactive framework and strategy for anticipating, handling, containing, and preventing incidents.

Read more
Incident resolution: Incident resolution is the final step in the incident lifecycle. In this phase, not only do customers no longer feel the effects of an incident, but engineers have implemented a long term, sustainable solution.

Read more
Incident responder: Incident responders are highly skilled individuals who harness their expertise to respond to and resolve incidents swiftly. Incident responders are instrumental in coordinating response efforts, facilitating communication, and implementing effective remediation strategies.

Read more
Mean time to assembly (MTTA): Mean time to assembly (MTTA) refers to the time between an incident alert and the start of the response to this incident.

Read more
Mean time to resolve (MTTR): Mean time to resolve (MTTR) is a performance metric that represents the average time between when an incident is detected and when the problem is completely resolved, meaning there’s a long term fix in place.

Read more
Milestone: A milestone is a significant event or stage within an incident response process. Milestones describe the active state of an incident and communicate to stakeholders the team's progress in resolving the issue.

Read more
Mitigation: Incident mitigation means taking temporary corrective measures to minimize the impact of an issue as the responders continue to work on a more permanent fix.

Read more
Observability: Observability is the process of understanding the state of your system by analyzing its telemetry data.

Read more
On call: Being on call means an employee is designated to work whenever an employer calls on them. In the DevOps world, on-call refers to the practice of designating specific engineers to be available in the case of an outage or major incident. Teams usually take turns being the on-call staff member so that someone is always available 24/7.

Read more
Outage: A system outage means that applications and other functions are temporarily unavailable or downgraded, likely impacting customers and slowing down business activities.

Read more
Remediation: Incident remediation means completely fixing an incident and fully returning a system to its normal state.

Read more
Retrospective: Holding a retrospective is the final step in your incident management plan. Conducting a retrospective means analyzing and reviewing an incident to identify root causes and areas for improvement to prevent similar incidents from happening in the future.

Read more
Runbook: A runbook automates rote tasks that must kick off when an incident occurs. It improves assembly time by executing crucial actions for addressing the incident.

Read more
Service catalog: A service catalog is a list of all services or product functionalities and their respective owners or subject matter experts. More advanced service catalogs also include things like rollback plans and documentation for each service. The aim of a service catalog is to help incident response teams quickly bring in the owner or expert related to an impacted service or functionality during an incident.

Read more
Severity: Most organizations define severity using a level system or a matrix, or sometimes both. Severity levels are pre-defined categories used to categorize and prioritize incidents based on their potential impact.

Read more
Site Reliability Engineering (SRE): Site reliability engineering (SRE) uses software tools to automate IT infrastructure tasks. Ideal SRE tasks include system management and application monitoring. Engineers tasked with reliability spend time between operational work and on-call duties. These responsibilities may include implementing automation, creating new features, or scaling a system to increase site reliability and performance.

Read more
Status page: A status page informs external and internal stakeholders about an incident. It’s a quick and simple way to communicate outages or scheduled maintenance to the right audiences.

Read more

Talking incidents

Definitions

Alert

Alert fatigue

Cycle time

Declaration

Impact matrix

Incident

Incident commander

Incident management

Incident resolution

Incident responder

Mean time to assembly (MTTA)

Mean time to resolve (MTTR)

Milestone

Mitigation

Observability

On call

Outage

Remediation

Retrospective

Runbook

Service catalog

Severity

Site Reliability Engineering (SRE)

Status page