Talking
incidents

Definitions for incident responders

Alert

An alert is a notification sent to a predetermined person or group when certain criteria are met. For example, an alert might be triggered by a certain number of unsuccessful login attempts and sent to the on-call engineer at that time.
Read more

Alert fatigue

Alert fatigue is a type of burnout that exists when on-call engineers are getting paged to deal with incidents too often. Sometimes alert fatigue can be so severe that teams become desensitized to incidents and, as a result, may fail to properly respond to incidents.
Read more

Cycle time

Cycle time is a performance metric measuring the time it takes from one incident milestone to the next. Some companies track cycle time to evaluate the health of their incident response efforts.
Read more

Declaration

An incident declaration is when someone at an organization officially documents a system degradation and kicks off the company's incident response process.
Read more

Impact matrix

An impact matrix, or impact effort matrix, is a tool for measuring the potential business impact and the amount of effort required to complete a task. Organizations use impact matrices to prioritize assignments. 
Read more

Incident

Although companies define incidents in varying ways, in general: An incident is any unexpected degradation or interruption of functionality in a company’s product, systems, or website that is of such a degree that it’s noticeable to customers and/or internal stakeholders.
Read more

Incident commander

An incident commander is accountable for resolving an incident from beginning to end. They also lead throughout the incident, directing colleagues to conduct mitigation and retrospective activities.
Read more

Incident management

Incident management is a proactive framework and strategy for anticipating, handling, containing, and preventing incidents.
Read more

Incident resolution

Incident resolution is the final step in the incident lifecycle. In this phase, not only do customers no longer feel the effects of an incident, but engineers have implemented a long term, sustainable solution.
Read more

Incident responder

Incident responders are highly skilled individuals who harness their expertise to respond to and resolve incidents swiftly. Incident responders are instrumental in coordinating response efforts, facilitating communication, and implementing effective remediation strategies.
Read more

Mean time to assembly (MTTA)

Mean time to assembly (MTTA) refers to the time between an incident alert and the start of the response to this incident. 
Read more

Mean time to resolve (MTTR)

Mean time to resolve (MTTR) is a performance metric that represents the average time between when an incident is detected and when the problem is completely resolved, meaning there’s a long term fix in place.
Read more

Milestone

A milestone is a significant event or stage within an incident response process. Milestones describe the active state of an incident and communicate to stakeholders the team's progress in resolving the issue.
Read more

Mitigation

Incident mitigation means taking temporary corrective measures to minimize the impact of an issue as the responders continue to work on a more permanent fix. 
Read more

Observability

Observability is the process of understanding the state of your system by analyzing its telemetry data.
Read more

On call

Being on call means an employee is designated to work whenever an employer calls on them. In the DevOps world, on-call refers to the practice of designating specific engineers to be available in the case of an outage or major incident. Teams usually take turns being the on-call staff member so that someone is always available 24/7.
Read more

Outage

A system outage means that applications and other functions are temporarily unavailable or downgraded, likely impacting customers and slowing down business activities. 
Read more

Remediation

Incident remediation means completely fixing an incident and fully returning a system to its normal state.
Read more

Retrospective

Holding a retrospective is the final step in your incident management plan. Conducting a retrospective means analyzing and reviewing an incident to identify root causes and areas for improvement to prevent similar incidents from happening in the future.
Read more

Runbook

A runbook automates rote tasks that must kick off when an incident occurs. It improves assembly time by executing crucial actions for addressing the incident.
Read more

Service catalog

A service catalog is a list of all services or product functionalities and their respective owners or subject matter experts. More advanced service catalogs also include things like rollback plans and documentation for each service. The aim of a service catalog is to help incident response teams quickly bring in the owner or expert related to an impacted service or functionality during an incident.
Read more

Severity

Most organizations define severity using a level system or a matrix, or sometimes both. Severity levels are pre-defined categories used to categorize and prioritize incidents based on their potential impact.
Read more

Site Reliability Engineering (SRE)

Site reliability engineering (SRE) uses software tools to automate IT infrastructure tasks. Ideal SRE tasks include system management and application monitoring. Engineers tasked with reliability spend time between operational work and on-call duties. These responsibilities may include implementing automation, creating new features, or scaling a system to increase site reliability and performance. 
Read more

Status page

A status page informs external and internal stakeholders about an incident. It’s a quick and simple way to communicate outages or scheduled maintenance to the right audiences.
Read more

See it in action

See how service catalog, incident management, and incident communications come together in a live demo.

Get a demo
Definitions