Analytics - Charts Deep Dive

Incidents by Infrastructure

Understanding where you have the most incidents is critical for any business. This chart allows you to see how incidents are distributed across your services or functionalities to identify where there are areas of improvement to invest.

For this chart, we are looking specifically at the Active Incidents in your organization. We will pull in the data for an incident that has been in an Active milestone state, between Started and Resolved, within the given date range.

IncidentsByInfraChart.png

Summary Cards Explained

  • Active Incidents. This card is a count of Active Incidents based on set query parameters. Active Incidents are defined as being started before or during the date range and do not have to be resolved to be included in this count. We do not include GAMEDAY or MAINTENANCE incidents in this total.
  • Functionalities / Services Impacted. This card will change based on your selection in the Infrastructure Drop Down. We do not include the 'None assigned' category represented in the pie chart in this count. 'None assigned' is when an incident does not have a functionality/service attached at the time of the query.

Chart Explained

See which services and functionalities have the most incidents associated with them. Each segment of the pie chart represents either a defined Functionality or Service, with the addition of a None assigned segment to count the incidents without a marked impacted infrastructure. The counts within the segments represent the incident count.

It is important to note that a single incident can have multiple impacted infrastructures. Incidents that fulfill this case mean that the number of incidents in the pie chart can exceed the count of Active Incidents in the summary card. For instance, you could have 4 Active Incidents and 2 Functionalities impacted based on the query selection. If one of the incidents impacted both Functionality A and Functionality B, the pie chart would reflect the double count.

New Incidents by Impacted Infrastructure

One of the most important ways to understand the health of your system is to look at the trends over time for each of the services and functionalities in that system. This chart focuses on the performance of your top-impacted infrastructures, charting the incident occurrence against days, weeks, or months based on your resolution.

For this chart, we are looking specifically at New Incidents created during the selected date range. This chart is significant to see when issues were introduced, allowing you to tie the information back to a release cycle or specific point in time.

NewIncidentsByImpactedInfraChart.png

Summary Cards Explained

  • Top Impacted Functionalities/Services. This card counts the number of impacted functionalities/services displayed in the chart. To help you focus on what matters, we limit this view to the top 10 impacted infrastructure components. This card will change based on your selection in the Infrastructure Drop Down. We do not include GAMEDAY or MAINTENANCE when looking for the top-impacted infrastructure.
  • Incidents Displayed. This card counts the number of incidents created during the specified date range, omitting any GAME DAY, MAINTENANCE, or incidents not associated with the Top Impacted Functionalities/Services. The total here does not include ongoing incidents that were declared before the start of the date range. The reason is to allow for an event-driven analysis, where you can use the timeframe to determine if a pattern of underlying factors led to a spike or decrease in incidents.

Chart Explained

Identify potential trends around when incidents come up. Was there a spike for Service ABC last month? Does Functionality XYZ have a recurring degraded status at the end of each week, corresponding to a manual release schedule? Use this chart as a starting point to dig deeper and surface questions for retrospective analysis.

Each stacked bar has the impacted service or functionality represented by a colored segment, tracking the number of new incidents over the determined resolution.

It is important to note that a single incident can have multiple impacted infrastructures. This case means that the sum of the number of incidents represented in the stacked bar chart could exceed the number of incidents totaled in the summary card. For instance, you could have 2 Top Impacted Functionalities and 4 Incidents Displayed based on the query selection. If one of the incidents impacted both Functionality A and Functionality B, The stacked bar chart would reflect the double count.

New Incidents by Severity

Regarding overall system health, teams don't just care about when incidents occurred but also want to see how severe the incidents were. This chart focuses on the organization's overall performance, charting the incident occurrence and severity against days, weeks, or months based on your resolution.

For this chart, we are looking specifically at New Incidents created during the selected date range. This chart is fundamental to see when issues were introduced, allowing you to tie the information back to a release cycle or specific point in time.

NewIncidentsBySevChart.png

Summary Card Explained

  • Incidents Displayed. This card counts the incidents created during the specified date range, omitting any GAME DAY, MAINTENANCE, or UNSET. The total here does not include ongoing incidents that were declared before the start of the date range. The reason is to allow for an event-driven analysis, where you can use the timeframe to determine if a pattern of underlying factors led to a spike or decrease in incidents.

Chart Explained

Identify potential trends when incidents come up. This chart is a great place to leverage additional conditional filters to isolate trends in specific infrastructure components to get a more granular view. You may want to understand if older services have more frequent but lower severity incidents, providing supporting data for refactoring or deprecation work. Use this chart as a starting point to dig deeper and surface questions for retrospective analysis.

Each stacked bar has the impacted service or functionality represented by a colored segment, tracking the number of new incidents over the determined resolution.

Mean Time Metrics

Time is of the essence when you are working through an incident. The mean time to when an incident is detected, acknowledged, mitigated, and resolved are standard measurements to track how efficiently your teams are moving through incidents, providing insights into your incident management process and potential bottlenecks.

MeanTimeMetricsChart.png

The general calculation for each milestone is

image3.png

To learn more about Incident Milestones, check out this document.

Summary Cards Explained

  • MTTD (Mean time to Detected). Using the above general calculation methodology, this metric is most helpful in tracking incidents created based on an alert. This metric gives insight into the timeliness of your alerts. If no alert is associated with the incident, the Detection milestone will remain null unless manually set. Only incidents that transitioned to the Detected milestone during the selected date range are included in this calculation.
  • MTTA (Mean time to Acknowledged). Using the above general calculation methodology, this metric can tell you how efficiently you're getting the right people to participate in the incident. Teams can be assigned using Runbooks, and the first step could be to move the incident to the Acknowledged milestone. This metric gives insight into the efficiency of your incident response process. Only incidents that transitioned to the Acknowledged milestone during the selected date range are included in this calculation.
  • MTTM (Mean time to Mitigated). Using the above general calculation methodology, this metric provides insight into the incident's span of impact – for how long did the outage or degraded service impact customers? This metric can unearth training opportunities where there may be limited documentation, leading to a longer time to mitigate or perhaps points to a bottleneck in shipping and deployment. Only incidents that transitioned to the Mitigated milestone during the selected date range are included in this calculation.
  • MTTR (Mean time to Resolved). Using the above general calculation methodology, you can see the overall time your team invests in incident response and resolution. Given that so many teams are involved in proper resolution, this could be a metric that your organization rallies around to improve consistently, setting different target SLAs for varying severities. Only incidents that transitioned to the Resolved milestone during the selected date range are included in the calculation.

Chart Explained

See all of your Milestone markers in one chart. Using the stacked line graph, get a more granular breakdown of the MTTX metrics by resolution group. If you select Weekly resolution, MTTX is calculated for each week or each resolution group. Hover over each node in the chart to see the numerical value for the resolution group. This view can help you determine if anomalous behavior aligns with staffing, coverage, or an emerging trend that could be mitigated with thorough training. Perhaps problems are taking longer to solve due to burdening technical debt or missing technical documentation.

Incident Resolution

Another vital piece of understanding the health of your incident response program is understanding if you are resolving incidents as they come in or if your team is accumulating a backlog of lingering incidents.

IncidentResolutionChart.png

Summary Cards Explained

  • Incidents Created. This card shows the number of new incidents declared during the selected date range within the filtering criteria. We calculate this using the incident start_date timestamp. We are omitting any GAME DAY, MAINTENANCE, or UNSET.
  • Incidents Resolved. This card shows the number of incidents resolved during the selected date range within the filtering criteria. We calculate this using the Resolved milestone occured_at timestamp. We are omitting any GAME DAY, MAINTENANCE, or UNSET severities from this chart.

Chart Explained

Identify potential trends in when incidents come up and when they are resolved. Using this chart, you can quickly see the breakdown over a resolution of newly created against resolved incidents to understand incident engagement in your organization. To get a more granular view, this is a great place to leverage additional conditional filters to isolate trends to specific infrastructure components. You may want to understand if older services are impacted less frequently yet still have a high volume of outstanding incidents. We surface this supporting data for refactoring or deprecation work. Use this chart as a starting point to dig deeper and surface questions for retrospective analysis.

Hover over each node in the chart to see the numerical value for the resolution group.

Retrospective Completion

What happens after an incident is just as important as what happens during an incident. Retrospectives are valuable as they lead to learning, shared accountability, and continuous improvement. This chart provides insight into your Retrospective habits after incidents are resolved. Add additional filters to see how different assigned teams might have other rituals or to see different frequencies based on an incident's severity.

RetroCompletionChart.png

Summary Cards Explained

  • Incidents Resolved. This card shows the number of incidents resolved during the selected date range within the filtering criteria. We calculate this using the Resolved milestone occured_at timestamp. We are omitting any GAME DAY, MAINTENANCE, or UNSET.
  • Retrospectives Started. This card shows the number of retrospectives started during the selected date range within the filtering criteria. We measure this using the Retrospected started milestone occured_at timestamp. We are omitting any GAME DAY, MAINTENANCE, or UNSET.
  • Retrospectives Completed. This card shows the number of retrospectives completed during the selected date range within the filtering criteria. We measure this using the Retrospected Completed milestone occured_at timestamp. We are omitting any GAME DAY, MAINTENANCE, or UNSET.

Chart Explained

Based on your selected resolution, the stacked line graph lets you see these critical milestones over time. Using this chart, you can see how timely and diligent your teams are in having Retrospectives. Hover over each node in the chart to see the numerical value for the resolution group.

New Incidents by Team

Understanding which teams are most often assigned to incidents is critical to your organization's culture and your product's success. This chart provides an overview of which teams are at risk of burnout from their sheer involvement in incidents and highlights potential single sources of failure. In either case, you can use this information to understand better how your teams are working and what support they might need.

For this chart, we are looking specifically at the Active Incidents in your organization. This means that we will pull in the data for an incident that has been in an Active milestone state, between Started and Resolved, within the given date range.

NewIncidentsbyTeamChart.png

Summary Cards Explained

  • Top Impacted Teams. This card looks at the teams assigned to the most incidents in the given time frame. We do not include the 'None assigned' category represented in the pie chart in this count. 'None assigned' is when an incident does not have a functionality/service attached at the query time. To help you focus on what matters, we limit this view to the top 10 impacted infrastructure components.
  • Incidents Displayed. This card is a count of Active Incidents based on set query parameters. Active Incidents have started before or during the date range and do not have to be resolved to be included in this count. We do not include GAMEDAY or MAINTENANCE incidents in this total.

Chart Explained

Identify when teams are assigned ownership of an incident. Depending on your selection, this stacked bar chart tracks team assignments against a daily, weekly, or monthly resolution. You can see how many incidents each team has been assigned, determining if the impact is spread evenly or if a subset of teams is significantly impacted. The time variable can indicate that a team needs to slow in shipping or can serve as a starting point to dig into other root causes that could lead to an influx in incidents. You may even see a senior team consistently assigned ownership simply because they have the most experience in your codebase. Use this chart to start conversations, help foster blameless retrospectives and drive accountability for tackling root causes.

Task and Follow-up Completion

What happens after an incident is just as important as what happens during an incident. Tasks and Follow-Ups are great tools for accountability and joint responsibility. Add additional filters to see how different assigned teams might have other rituals or to see different frequencies based on an incident's severity.

TaskFollowUpCompletionChart.png

Summary Cards Explained

  • Tasks/Follow-ups Created. This card shows the number of tasks or follow-ups created during the specified date range. We are omitting tasks or follow-ups associated with incidents marked as GAME DAY or MAINTENANCE. We also exclude tasks or follow-ups that were Canceled. This card will change based on your selection in the Action Item drop-down.
  • Tasks/Follow-ups Completed. This card shows the number of tasks or follow-ups completed during the specified date range. Tasks and Follow-ups are identified as completed based on their done status and last updated_at timestamp. We are omitting tasks or follow-ups associated with incidents marked as GAME DAY or MAINTENANCE. This card will change based on your selection in the Action Item drop-down.

Chart Explained

How often have you looked into a bug to find a lingering #TODO memo inline? See the burndown of action items in this stacked line chart, comparing the number of tasks created versus the number of tasks completed. While this is a simple metric, it can help engage conversations around: "Are we creating meaningful follow-ups?" and "Are we holding people accountable for completing the follow-ups?" Over time, use this data to see if it positively impacts overall system performance or even other metrics, such as MTTM or MTTR.

Last updated on 12/7/2023