There’s no better time than now to dedicate effort to reliable software.
Say what now?
If it wasn’t apparent before, this past year has made it more evident than ever: People expect their software tools to work every time, all the time. The shift in the way end-users think about software was as inevitable as our daily applications entered our lives, almost like water and electricity entered our homes. A global pandemic accelerated that expectation as our livelihoods adjusted to be dominantly online for entertainment, saying hello to loved ones, and doing our jobs.
Customers want reliable software.
If you’re reading this, there’s a great chance an outage disrupted your life in one way or another. It could be the hours-long Slack, Gmail, or Zoom outages. Or it could be trying to watch a show on Disney+ or Hulu. Or when market volume skyrocketed, sending several brokerages down like TDAmeritrade and Robinhood.
Website blips and outages had a level of empathy associated with them in the internet’s early days. Dial-up connections would be disconnected by someone merely picking up the phone. Servers would get turned off on weekends for government websites. Restarting our computers was a common remedy that was largely tolerated. Those days are behind us, and ahead is a new expectation from people using our services: they expect them to be reliable.
The software industry is wising up to the fact that incident response already has precedents. Google’s incident response is based mainly on the United States FEMA disaster recovery processes. Remediation runbooks to restart servers take cues from the airline industry with quick actions that operators can perform. Software teams perform incident retrospectives and publicize their findings to their customers and the world, much like how the investigation of airline disasters has happened for years.
Customer trust is sacred, and poor reliability is one of the top ways to erode that trust.
Operators want reliable software.
I’ve never been paged in my career and been excited to open my laptop at a restaurant, even though I have several times. Reliable software means that software engineers can stay focused on company initiatives such as shipping new products or improving internal development processes. When reliability becomes an afterthought of software, it has wide-reaching implications. For example, software engineers can easily fall into the trap of happy-path only development, ignoring the possibility of failure entirely.
Over time, disruptions caused by a lack of reliable software causes an engineering team to enter into a brutal interrupt-driven development phase. Interrupt driven development is also the easiest way to burn out an entire engineering team. When you have burned out engineers that suddenly depart, it can cause an organization to overhire to cover ground. Hiring at a breakneck pace can also cause a cultural shift in the company, potentially generating more employees to leave. It’s a vicious cycle.
If you write software or manage a team that writes software, try performing proactive reviews of software that is planning on being deployed. Ask questions like “how can this potentially break?” and “what is a potential side effect of this change?“. Ask engineers to add it to their pull requests, design docs, and release notes. Incident retrospectives are a lagging indicator of how people thought the system would behave when a few questions can often reveal problems before they’re problems.
How do we achieve reliable reliability?
Reliability is an infinite game. There’s no agreed-upon rules, no time limit, no teams to play against. Reliability is what you want it to be, and it’s up to your team and leadership to decide the level of reliability you want to achieve. There’s no panacea to software reliability; it’s what you do every day, week, and month that creates reliability. It’s the simple consistency of caring drives reliability.
With that said, here are a few ways to put a team into first gear towards destination site reliability.
Leverage your incident retrospectives and put your learnings into action
As a starting point, you can start by analyzing and learning from incidents through retrospectives (p.s. drop the term “postmortem,” no one died). Take the time to understand what everyone was doing to resolve the incident, why they were doing it, and under no circumstances blame anyone for an outage. Primarily, you’re trying to understand how an incident came to be. By identifying multiple contributing factors, no singular root cause, you can spend time in each of the identified areas and plan actions accordingly.
While retrospectives are a great way to understand why something failed, it’s almost more important to know why things work at all. If our SLAs designate we’re available 99.99% of the time, why are we spending time looking at 0.01% of the time for ways to be more reliable? Read the paper on Safety-II for more information on this idea. I also loved being in the audience for this talk by Ryan Kitchens.
Push the limits with Chaos engineering
Trying to “break things on purpose,” a Gremlin mantra , is a great way to build resilience and reliability. The vaccine of outages is causing them intentionally so your team can react and build better software. Chaos engineering is one of my favorite ways to engineer better systems.
I also believe chaos engineering practices apply to more than just breaking software; it should include breaking processes, too. Watch Incident Ready: How to Chaos Engineering Your Incident Response Process to see what I mean.
Enable your teams to quickly turn things on and off with release & feature flags
Feature flags aren’t a new idea, but there are more reasons than ever to start using them in your software stack. Complex systems and deploys sometimes need a light switch to make something go dark in the event of a failure. I suspect this is why LaunchDarkly chose the name they did. FireHydrant uses LaunchDarkly extensively in nearly every new feature we build to roll them out to our customers slowly.
I’ll give you an example of a real-life feature flag: airplane spar valves. If an engine shutdown fails, pilots will shut a spar valve completely, cutting off fuel to the engine, allowing it to starve and shutdown. I imagine we’ll start seeing feature flags and release flags used more frequently for circuit breakers in complex applications in the same way pilots use spar valves for turning off engines in dire situations.
Know what’s in your complex system by cataloging your service
In any sufficiently complex system, the number of running applications will surpass an individual’s cognitive ability to keep track of them all. The inner workings of dozens, hundreds, and thousands of microservices exceed our brainpower. It’s essential to have an “address book” of all of the services you rely on. YellowPages (remember those?), if you will, so we know where the service is, who to call when it degrades, and its purpose. It would be best if you documented what is out there, what the service does, who owns it, and who knows the most about it (hint: these are commonly not the same people).
Dispatchers are a critical part of the response process for getting the right fire station to the emergency as fast as possible. Dispatchers have a catalog of battalions, fire stations, and the neighborhoods they serve. Without knowing who is where and where emergencies are, dispatching emergency services would be significantly less efficient.
Have a plan and a central knowledge base with simple runbooks
I’ve seen nearly every company I’ve worked for have an alert fire when a database starts to approach its max disk capacity. After some time, the alert inevitably fires, and then a collective “ok, what now?” face trades between everyone. Simple runbooks are an excellent way for a team to develop the beginnings of shared knowledge, breaking down the tribal knowledge that likely exists, which is only revealed in crisis.
Runbooks are nothing more than a checklist, which Atul Gawande’s book The Checklist Manifesto: How to Get Things Right covers extensively. With simple checklists, ICU units in Michigan saw a reduction of 66% in infection rates. Within 18 months, the simple checklists saved an estimated 1500 lives.
These simple lists work, and we can achieve even more reliable software with them.
Reliability is hard to measure, but implementing service level objectives is a great way to start. My recommendation here is to create SLOs that don’t measure CPU or Memory usage but instead measure customer happiness. I fear most on-call engineers have been paged in the past for benign problems such as “memory is at 80%“. By alerting on SLOs that measure customer experience, not computer vitals, we’re targeting a better sense of what reliability in our system truly means.
As for a book recommendation on implementing practical service level objectives, I highly recommend Implementing Service Level Objectives by Alex Hidalgo.
Every day is a chance to be more reliable.
When striving for reliable systems, I urge you to remember that shavings make a pile. There’s no singular action you can take to build reliable software. Reliability is a habit that everyone must buy into and practice every day.