Reliability is not an engineering metric

If you're an engineer reading this, you might be wondering what I mean by the title. You might be a Site Reliability Engineer whose primary responsibility is to maintain the reliability of your company’s product/solution. You might be a software builder, a programmer responsible for building new capabilities and shipping them to production.

All of these are important for any business to remain competitive. The website needs to be available for people to use (and trust you), and it requires constant enhancements to its functionality for users to stay long term. But here's the catch, the engineering team isn't the sole owner of reliability. It's everyone at the company's responsibility to be "reliable" (spooky quotes intentional).

The Customer is Always Right#the-customer-is-always-right

There's an interesting trend going on in software the last few years, and we've had a front-row seat for it here at FireHydrant. Reliability is a hot topic at nearly every company we talk to, and dominantly, the engineering team wants to improve their reliability. However, I think it's time we start thinking about what reliability really is.

If Netflix goes down, no one in the world is saying, "Netflix's engineering team is having an outage." Instead, they'll usually say, "Netflix is down." That's it. No assignment to a team, no "partially degraded" talk, nothing. Right now, Netflix simply doesn't work. (No shame to Netflix here, just everyone knows who you are).

Reliability isn't a dial on a dashboard you get to define; it's defined by your customers. If your customers don't think you're reliable, you're not. If a customer comes to you and says, "I can't check out my shopping cart right now," and you show them a dashboard that says, "Our uptime metric says we're ok right now," said customer will not be happy with that response.

Defining Services#defining-services

We have a problem in the engineering world right now: services. Depending on who you are, you might've read the word "services" as:

Microservices
Product Functionalities
Customer Success or Support
Professional Services
Maybe even a restaurant service industry

Dictionary.com has over 30 definitions for the word service. And therein lies the problem; we have an overloaded term wreaking havoc on how we talk about reliability at a company level. I can't go to my sales team and say, "our services are unreliable lately," they'll think I'm talking about professional services, not our L7 load balancer.

We can see this manifesting in the usage of service level objectives (SLOs). Engineering teams are tying service level objectives directly to the applications they build and run in their stack. While this isn't the worst thing in the world, it begs the question, does it really matter if a low-level service is issuing auth tokens at a 99.999% success rate, or does it matter that a user can log in and their session is persisted?

We'd be better off fastening our level of service to product functionalities, not how fast or complete the bits go from one end of the wire to the other. For example, our objectives should be directly aligned with what a customer would complain about (what I like to call "pitchfork alerting"). When they click " Place Order, " the customer doesn't care if there are 50 services under the hood performing tax calculation, credit card processing, and shipping label creation. They simply care that the next page says that their order has been confirmed. A functionality not performing the way a customer expects it to is the only thing we should be measuring and paging an on-call engineer about.

This begs the question: should we call them "Functionality Level Objectives" or even "Happiness Level Objectives" instead?

Engineering doesn't own reliability.#engineering-doesnt-own-reliability

Engineering indeed owns many things that make a system reliable, but customers view reliability as a business metric.

Let's take Fastly as an example which experienced a global outage this year. People worldwide couldn't access thousands of websites that utilized Fastly as their preferred CDN for about an hour on June 8th, 2021. Global outages for companies involve every team such as:

Engineering - Incident response teams mitigating the outage.
Marketing - Managing public relations and responding to journalist inquiries.
Legal - Looking into contracts and SLA obligations that were negotiated, bracing for potential lawsuits.
Sales - Replying to their accounts that are reaching out, smoothing over relationships. Rescheduling demos that day.
Customer Support - Probably hit the hardest with an influx of ticket volume; they need to reply quickly and efficiently.
People & HR - Helping folks internally with the stress, potentially offering days off and care packages.

It takes the entirety of a company to be reliable. People view reliability as something a company provides, not the engineering organization alone. Our customers say, "We use FireHydrant," not "We use the features FireHydrant's engineering team built."

Where to now?#where-to-now

It's my opinion that you should take what your customers think about your reliability more seriously than any metric your dashboards have on them. Luckily, there are likely ways you can collect this data today.

Talk to your solutions engineering team.#talk-to-your-solutions-engineering-team

FireHydrant has a solutions engineering team, and they're acting like a customer every day giving demos of our product. There was a point where a solutions engineer felt uneasy demoing a particular part of our product because it behaved in unexpected ways sometimes. There were no errors in our logs and no exceptions in our Bugsnag account, but he felt it was unreliable, so therefore, it was.

Tag support requests to product areas#tag-support-requests-to-product-areas

Our customer success and support team tags every ticket we receive with the product area it's associated with. Tagging these tickets with the product area allows us to quickly see which parts of our product give customers the most problems, directly mapping our customer's view of reliability (enough to file a ticket) to our product's functionalities.

Ask your product team.#ask-your-product-team

Your product team likely talks to customers all the time, and I would bet my future farm that they hear customer complaints about random things around your product. A poor user experience will ultimately be mapped to reliability for a customer.

Reliability is a business metric.#reliability-is-a-business-metric

Maybe instead of measuring our uptime as our reliability metric, we should calculate how much of our available time the company is spending on unexpected customer pain. From engineering to sales, marketing, and customer support, it all rolls up to the one thing our customers want: A consistently reliable experience.