3 mistakes I’ve made at the beginning of an incident (and how not to make them)

The first few minutes of an incident are often the hardest. Tension and adrenaline levels are high, and if you don’t have a well-documented incident management plan in place, mistakes are inevitable.

It was actually the years I spent managing incidents without the right tools in those high-tension moments that inspired me to build FireHydrant. I built the tool I wished I’d had when I was trying to move fast at the start of incidents.

Let’s look at three mistakes I’ve made during those stressful moments during the beginning of an incident — and discuss how you can avoid making them.

Mistake 1: We didn’t have a plan.#mistake-1-we-didnt-have-a-plan

A production database got dropped. An engineer thought they were connected to their local machine, but they were connected to prod, ran a command to drop a database (again, thinking it was local), and it was not. Big oops. It took down the whole product; customers couldn’t do anything at all. And, of course, this was at 5:30 p.m. on a Friday, so everyone was on their way to happy hour as the pages started going out, and (oh yes, it gets better) our office was having off-hour network maintenance done at 6 p.m, so the internet was about to go out. Chaos ensued.

Everyone was having their own version of a panic attack and speculating, “Did we get hacked? How is this possible?” We found a small conference room on another floor of the building and crammed 14 people into it. There were people doing wall squats holding their laptop on their knees, people holding their laptop in one hand and typing in another, the C-levels were frantically texting engineers who were already in the room. It was mayhem.

And that didn’t even include the folks not in the building. This incident had a huge blast radius, so everyone was getting paged. We had disparate remote engineers unknowingly doing work that was already being done by someone else. There was confusion over domain expertise and tiffs about what the right move was.

The outage lasted several hours. There were so many mistakes we made — we didn’t start a war room, we didn’t name an incident commander, we didn’t timestamp, we didn’t communicate, we didn’t have a task list, we didn’t declare ownership in advance. It all comes down to: We didn’t have a plan.

At that time, we didn’t have an incident management tool to automate and centralize all the information and communication we needed to be successful. But there are foundational incident management processes that could’ve been put in place that just weren’t there.

So how do you not be me and that team? At a minimum, you should set and document a clear process, declare service ownership by product area, and run a drill in order to identify deficiencies in your plan. An even better move would then be to automate all of that. That way, when you get paged while on your way to margaritas on a Friday after work, you can trigger an incident that spins up tickets, gets everyone in the same Slack channel, and gets communication going while you get to work on remediation.

Mistake 2: We weren’t production ready.#mistake-2-we-werent-production-ready

On to the next incident! At this company, I had moved into more of a site reliability engineer role (I was their Shaq). We updated our deployment pipelines to use Spinnaker — we assumed since it was used by a lot of big companies, it was perfect for our situation too and was production ready. We set it up so all deployments and the data about them were stored in Redis, and that’s how Spinnaker knew the current state of deploy. And since that wasn’t complex enough for us, we then used Jenkins to run our tests and build deployable artifacts; when the build turned green, it would launch a deployment pipeline in Spinnaker.

Then one day, Redis died on us, which meant Spinnaker lost all context for what it should be doing and what — and this is a crucial point — it had done in the past. It didn’t even produce bugs. We found out there was a problem because our customer support front notified us that the fonts looked different on our website, then came alerts in the form of, “Hey, where’d this page go?” Because Spinnaker had no idea that it had executed a deployment pipeline for a successful build three months ago, thousands of deployments had been kicked off. The entire website was reverted to one that was three months old. Chaos ensued, and we just turned Spinnaker off. We literally just cut the power to it, then manually deployed the website’s current version.

Our mistake? We didn’t have production readiness. We didn’t even know what that meant for that particular service, for Spinnaker, and that was because we didn’t step back and think about what “production ready” meant for us specifically.

If we’d sat down for a day in a room and talked about all the components involved in using Spinnaker for deployments and all the ways it could fail, if we’d made sure we had production-readiness checklists for every component, we would’ve eventually gotten to Redis. And we would have put in place a backup version that could be switched to in an incident. It’s actually an easy solution … we just didn’t take the time to get there.

There is so much value in tiering services. We would’ve likely labeled Spinnaker and Redis a Tier 1 service, which would’ve had a different production-ready rigor. It’s part of why now FireHydrant’s service catalog has production-ready checklists that can be added to services based on tiers.

Mistake 3: We fell down a cognitive tunnel.#mistake-3-we-fell-down-a-cognitive-tunnel

Cognitive tunneling is a very common, very human thing to do. You rely on what worked in the past (i.e. a narrow set of data) and you concentrate so much on that, that you lose the ability to step back and consider other possibilities.

At this company, like so many, we employed code freezes from the November/December holiday season into the first couple weeks of January (For the record, I’m very anti-code freeze, and this incident might be why). We were many weeks into the moratorium and the product just went down, turned off entirely.

We’re all thinking, “How did this happen? What deployed? We’re in a moratorium, so nothing! So why is the site down?” We started flailing around because we had never experienced that type of incident before, so our default strategy of looking at recent deploys didn’t apply.

We went down the path of looking at traffic and eventually we noticed a weird pattern in Datadog where memory looked like the peaks and valleys of an EKG machine readout up until the moratorium. It turned out that we had a pretty bad memory leak in the Rails app, and it looked like an EKG machine because we only deployed on Tuesdays and Thursdays. So every time we deployed, memory would drop. That meant that when we stopped deploying, memory began increasing. After three weeks, it maxed out, the database started swapping, and the Rails app crashed due to a request queue overflow.

The cause ended up being a bug in Rails that had already been patched, but we were behind several versions. It was creating new prepared statements anytime someone searched for something by date, and these new instances drove up memory. We realized all of this after we’d already said screw it and restarted the app — which is not something you want to do in a moratorium for a variety of reasons, including potential SOC compliance violations for going around a well-defined control.

The mistake was that we were so locked on to incidents of the past that we couldn’t adjust our strategy for a new one. We didn't have the right mindset going into it. Once you’re in the cognitive tunnel, it’s so hard to get out — what could’ve been a couple-minute glitch ended up being a several-hour incident.

Avoid this trap by asking yourself once in a while, “Am I cognitive tunneling right now?” You could even consider adding a runbook step that prompts the incident commander at intervals to check on the team and make sure they’re not down a tunnel.

Even for a lot of severe incidents, the fix is usually not that complex. There are plenty of exceptions, of course, but for a large amount of incidents, “restart or roll back” will do the trick. So if you’re down a rabbit hole of investigation for more than 30 minutes, it’s time to do a mental reset. If we’d been able to get out of the tunnel, it would’ve solved this particular problem and almost every other incident that lasted over an hour in my career.

Conclusion#conclusion

For most people, the hardest time to think logically and methodically is during an emergency situation. That’s no poor reflection on us as individuals, it’s our evolutionary fight, flight, or freeze reaction going into overdrive. The best thing we can do is be prepared in situations like these. The fewer rote decisions you have to make, the less context shifting you have to do, the faster you can remediate and get to making sure it doesn’t happen again.