True reliability takes into account all of the services that exist in your software environment — which is why it can get so complicated. An ecommerce site, for example, might have services that update current inventory in near real time, process payments in the shopping cart, trigger email receipts to send, kick off fulfillment orders, etc. And if one of these services isn’t operating at its best, that can mean money — and in some cases, customers — lost for the company.
There is a way to add accountability to these complex systems and empower the product and engineering organizations as a whole to think critically about their reliability: Service Level Objectives (SLOs).
What is an SLO?
An SLO, or service level objective, is a targeted level of service – measured by Service Level Indicators (SLIs) and used to make Service Level Agreements (SLAs) between a vendor and a customer that spells out consequences if the objective isn’t met. For example, an SLO might specify the percentage of orders successfully processed across a specified period (that’s the indicator) and dictate how the vendor might compensate the customer if that number isn’t met through vendor error or downtime (that’s the agreement).
Other examples include:
Percentage of images that load correctly within 1 minute
Measurement of uptime for a site over the course of 24 hours
Percentage of credit card orders processed successfully over a specified time period
Really, what we’re measuring with an SLO is the functionality provided to the end user. Are they successfully able to do what they set out to do on your site or with your product? A “functionality provided” mindset can be a catalyst to implementing a much more effective way of alerting our engineers at 2 a.m. Too often, we find ourselves alerting on-call engineers about high CPU, low memory, or disk capacity — all of which focus the engineer on the pain a computer is feeling and less about the customer’s woes. Crafting good SLOs that keep the end user in mind can help minimize the background noise so your on-call engineers can focus only on impactful problems.
What makes a good SLO?
A good objective:
Is crystal clear and not up for interpretation. You should be able to create and measure it with proper indicators (which we’ll explore shortly).
Almost always always revolves around your users and their needs, so indicators should measure the “discomfort” of your users while your objective should be the amount of user discomfort you are willing to tolerate.
Takes a time limit into account and almost always focuses on an objective you’re trying to meet as an organization.
Let’s use an example of a restaurant processing its orders on our platform for this post. If our alerts revolve around the action our customer is trying to take and a reasonable timeframe in which it should happen, we can change the perception of what an alert is all about. So in this case, an SLO might be considered violated if 2% of orders fail within a 5-minute timeframe, and we could write this as: "This objective is met if the failure rate of restaurant orders processed is less than 0.01% measured every five minutes."
Clearly, using this approach will involve more than just your engineering organization. You’ll want to take business goals into account, and that can include product goals, customer support, sales, your executive team, etc. This doesn’t have to be overly complicated though, especially if you’re new to setting SLOs. The key steps here are:
Determine what teams should be involved and create a squad to ensure representation.
Get alignment on SLOs based on inputs from the identified teams.
Set up a tool to accurately track and measure your SLOs.
Implement the SLOs, gather feedback and learnings, and revise as needed. This isn’t a set-and-forget process; as your business grows or your goals change, your SLOs might, too.
SLIs vs. SLOs vs. SLAs
SLIs, SLOs, and SLAs all rely on each other — and in this order. Without indicators we can’t define objectives, i.e. what’s normal or abnormal. And without objectives, we can’t create agreements, which is what we do when we miss our objectives. To break it down:
An SLI is a clear metric that tells us how we are doing in relation to our SLO.
An SLO is the objective for a service — or functionality — we offer. When we violate this objective, we face the consequence defined by our SLA.
An SLA defines the consequences we suffer when we violate our SLOs.
SLOs vs SLAs
SLOs let you define SLAs. It’s important to make sure you define the ramifications that will occur in the event of a breach of SLO. When you’re defining an SLA, step away from the engineering and product teams and into the world of sales, executives, and customers. Defining an SLA should have the customer as the primary focal point to ensure you create something that makes sense for both you and them.
For our use case, our leadership team decides that we’ll credit restaurants whatever their average revenue is divided by the total number of minutes we are in breach of our SLO. We can state this as:
"In the event of our online order system not meeting our SLO, customers are entitled to a reimbursement of their average revenue per minute, times the number of minutes of downtime."
SLOs vs SLIs
SLIs let you evaluate whether you’ve missed an SLO. For example, let’s imagine we break our failure rate SLO for nine minutes. The restaurant has a monthly average revenue of $10,000 through our platform. If we use 43,800 as the number of minutes in a month, then the restaurant has an average revenue of $0.22 a minute. So for a nine-minute outage, we would owe our customer 9 \* $0.22, or $1.98.
Once you have an objective and agreement well defined and have agreed to it across the team (product, engineers, etc.), you are ready to create Service Level Indicators that inform you if you are meeting that objective or not. An indicator requires one thing: metrics. Metrics are not always the same as indicators, although sometimes they might be. Metrics should be queryable in a system like Grafana or SignalFx and be collected for every part of your application’s stack.
If nothing else, you should be collecting what’s known as RED metrics: Rate, Errors, and Durations. These are enough to create and measure a few different service level objectives. For example, if you have a web application where users are able to place orders online for a restaurant near them your RED metrics would be:
Number of orders
Number of order errors (a 5xx HTTP response code)
How long each order request is taking to complete
Since we have agreed to an objective of restaurant orders processed, we can figure what metrics to measure to accurately capture that objective, in this case we can create an SLI called “order failure rate”.
Failure rate is simply the total number of errors divided by the total number of orders. For example, if our restaurant application receives 500 orders and 10 of them fail we have an order error rate of 2%.
Managing your SLOs effectively
By including the right teams from the get-go, you can avoid many of the biggest challenges organizations encounter in setting their SLOs, like understanding how metrics map to company goals and achieving alignment on objectives. These objectives are also a great way to define alerts and pages for your teams as well because they’re measuring user impact in a clear way that responders can respond to quickly.
It’s not really a matter of if a service will go down in this age of post-digital transformation, it’s when. By setting ownership with and maintaining a service catalog, understanding and implementing best practices for incident management, and having runbooks ready for when an incident happens, you can minimize the impact an incident will have on your objectives.
Although that might sound like a lot, you don’t have to build out the process yourself. Incident management platforms like FireHydrant can provide end-to-end incident management and a sound framework for getting you from alert handoff to resolution quickly by unifying your team, providing reliable consistency, and integrating with the tools you already use.