Observability

What is observability?#what-is-observability

System observability follows the principle of determining the internal state of a system by looking at its outputs. Most of these measurable outputs fall into three categories, known as the pillars of system observability:

Logs: records of specific events that happen as code runs
Metrics: quantitive data that help teams deduce the effectiveness of the system over time (e.g., number of requests per second, amount of processing power used by an app, etc.)
Traces: contextual information for identifying which logs and metrics you need to answer a specific question about your system

By looking at these three pillars, teams can answer questions like, “Which services keep causing slowdowns and need to be optimized?” or “Where and why did a given error occur?”

However, it can be challenging to fully observe and understand everything in a modern-day system because of complexities like cloud infrastructure, containerization, and microservices. Compiling telemetry data requires teams to look through several tools for monitoring, tracing, log analysis, metrics, etc. This process involves a lot of context-switching and guesswork.

Because this manual process can take so long, many organizations implement an observability solution such as Honeycomb. Observability tools centralize telemetry data into a single location, enabling users to drill down or zoom out and deduce what this data means. Many incident management teams also rely on a service catalog to foster observability across their organizations’ apps and services.

Why is observability important?#why-is-observability-important

Collecting concrete metrics, logs, and tracing that answer questions about your systems’ performance can help you quickly identify and solve issues. But without centralized, holistic observability, teams have no choice but to manually compile the data when a problem arises. This process can take up lots of time and resources.

How does observability help with incident management?#how-does-observability-help-with-incident-management

Observability is essential to successful incident management. When an incident occurs, it’s up to several team members to diagnose the problem and find a solution — as quickly as possible. Just as a doctor would look at a patient’s symptoms to diagnose their sickness or injury, incident management teams must look at the metrics, logs, and tracing to determine the root cause of an incident. Without a centralized way to check the system’s “symptoms,” incident management teams have lots of extra work cut out for them.

Along with an observability platform, many incident management teams use a service catalog to observe and collect additional details on their organization’s apps and services. A few of the extra details that a service catalog offers include:

Information on each deployment and dashboard across the organization
Centralized data about each service running in the business
Lists of which teams are responsible for the well-being of which services and apps
Automated incident processes for notifying the right team when a specific service goes down

Once your team establishes a service catalog, there are several ways to ensure that it matures with your organization. The ultimate goal is a fully fleshed-out service catalog that includes dependencies, owners, and links to operational documentation. You can start with the basics, then level up from there with enriched meta-data, automations, etc.

Observability

What is observability?#what-is-observability

Why is observability important?#why-is-observability-important

How does observability help with incident management?#how-does-observability-help-with-incident-management

See FireHydrant in action