Talking
incidents
Definitions for incident responders
Definitions
- Alert
- Alert fatigue
- Cycle time
- Declaration
- Impact matrix
- Incident
- Incident commander
- Incident management
- Incident resolution
- Incident responder
- Mean time to assembly (MTTA)
- Mean time to resolve (MTTR)
- Milestone
- Mitigation
- Observability
- On call
- Outage
- Remediation
- Retrospective
- Runbook
- Service catalog
- Severity
- Site Reliability Engineering (SRE)
- Status page
-
Alert
-
An alert is a notification sent to a predetermined person or group when certain criteria are met. For example, an alert might be triggered by a certain number of unsuccessful login attempts and sent to the on-call engineer at that time.Read more
-
Alert fatigue
-
Alert fatigue is a type of burnout that exists when on-call engineers are getting paged to deal with incidents too often. Sometimes alert fatigue can be so severe that teams become desensitized to incidents and, as a result, may fail to properly respond to incidents.Read more
-
Cycle time
-
Cycle time is a performance metric measuring the time it takes from one incident milestone to the next. Some companies track cycle time to evaluate the health of their incident response efforts.Read more
-
Declaration
-
An incident declaration is when someone at an organization officially documents a system degradation and kicks off the company's incident response process.Read more
-
Impact matrix
-
An impact matrix, or impact effort matrix, is a tool for measuring the potential business impact and the amount of effort required to complete a task. Organizations use impact matrices to prioritize assignments.Read more
-
Incident
-
Although companies define incidents in varying ways, in general: An incident is any unexpected degradation or interruption of functionality in a company’s product, systems, or website that is of such a degree that it’s noticeable to customers and/or internal stakeholders.Read more
-
Incident commander
-
An incident commander is accountable for resolving an incident from beginning to end. They also lead throughout the incident, directing colleagues to conduct mitigation and retrospective activities.Read more
-
Incident management
-
Incident management is a proactive framework and strategy for anticipating, handling, containing, and preventing incidents.Read more
-
Incident resolution
-
Incident resolution is the final step in the incident lifecycle. In this phase, not only do customers no longer feel the effects of an incident, but engineers have implemented a long term, sustainable solution.Read more
-
Incident responder
-
Incident responders are highly skilled individuals who harness their expertise to respond to and resolve incidents swiftly. Incident responders are instrumental in coordinating response efforts, facilitating communication, and implementing effective remediation strategies.Read more
-
Mean time to assembly (MTTA)
-
Mean time to assembly (MTTA) refers to the time between an incident alert and the start of the response to this incident.Read more
-
Mean time to resolve (MTTR)
-
Mean time to resolve (MTTR) is a performance metric that represents the average time between when an incident is detected and when the problem is completely resolved, meaning there’s a long term fix in place.Read more
-
Milestone
-
A milestone is a significant event or stage within an incident response process. Milestones describe the active state of an incident and communicate to stakeholders the team's progress in resolving the issue.Read more
-
Mitigation
-
Incident mitigation means taking temporary corrective measures to minimize the impact of an issue as the responders continue to work on a more permanent fix.Read more
-
Observability
-
Observability is the process of understanding the state of your system by analyzing its telemetry data.Read more
-
On call
-
Being on call means an employee is designated to work whenever an employer calls on them. In the DevOps world, on-call refers to the practice of designating specific engineers to be available in the case of an outage or major incident. Teams usually take turns being the on-call staff member so that someone is always available 24/7.Read more
-
Outage
-
A system outage means that applications and other functions are temporarily unavailable or downgraded, likely impacting customers and slowing down business activities.Read more
-
Remediation
-
Incident remediation means completely fixing an incident and fully returning a system to its normal state.Read more
-
Retrospective
-
Holding a retrospective is the final step in your incident management plan. Conducting a retrospective means analyzing and reviewing an incident to identify root causes and areas for improvement to prevent similar incidents from happening in the future.Read more
-
Runbook
-
A runbook automates rote tasks that must kick off when an incident occurs. It improves assembly time by executing crucial actions for addressing the incident.Read more
-
Service catalog
-
A service catalog is a list of all services or product functionalities and their respective owners or subject matter experts. More advanced service catalogs also include things like rollback plans and documentation for each service. The aim of a service catalog is to help incident response teams quickly bring in the owner or expert related to an impacted service or functionality during an incident.Read more
-
Severity
-
Most organizations define severity using a level system or a matrix, or sometimes both. Severity levels are pre-defined categories used to categorize and prioritize incidents based on their potential impact.Read more
-
Site Reliability Engineering (SRE)
-
Site reliability engineering (SRE) uses software tools to automate IT infrastructure tasks. Ideal SRE tasks include system management and application monitoring. Engineers tasked with reliability spend time between operational work and on-call duties. These responsibilities may include implementing automation, creating new features, or scaling a system to increase site reliability and performance.Read more
-
Status page
-
A status page informs external and internal stakeholders about an incident. It’s a quick and simple way to communicate outages or scheduled maintenance to the right audiences.Read more
See it in action
See how service catalog, incident management, and incident communications come together in a live demo.
Get a demo
Definitions
- Alert
- Alert fatigue
- Cycle time
- Declaration
- Impact matrix
- Incident
- Incident commander
- Incident management
- Incident resolution
- Incident responder
- Mean time to assembly (MTTA)
- Mean time to resolve (MTTR)
- Milestone
- Mitigation
- Observability
- On call
- Outage
- Remediation
- Retrospective
- Runbook
- Service catalog
- Severity
- Site Reliability Engineering (SRE)
- Status page