Failover Conf Wrap-up

Failover Conf was held on April 21, 2020, online. The folks at Gremlin came up with the idea of a virtual conference about reliability after many in-person conferences started being postponed or canceled due to COVID-19. The conference was a lot of fun to attend. I’ll be sharing some of my thoughts on the event and the talks I was able to catch. I’ll link to videos of the individual talks below, or you can watch the playlist on YouTube. Disclosure: my employer, FireHydrant, was a sponsor of the event.

I missed almost all of the first talk, due to some initial problems with the conference hosting platform. I tried to join the stream with no luck, even though I knew folks that were connected. I wasn’t surprised as I knew there were over 7000 registrants, and that’s a lot of people trying to join at once. Eventually, someone in the conference Slack figured out that you could join by selecting the second talk from the schedule and clicking the link to join it, which might mean there was some rate limiting based on the individual talk slots. I was bummed to miss Tammy Butow’s talk on Chaos Engineering, but this was one of the few technical blips I experienced during what was a very well run event.

Up next was one of my favorite people, Matty Stratton, or Marty as my phone’s autocorrect likes to call him. Matty’s talk was about healing organizational trauma, and it resonated with me a lot (view the video). People tend to discuss trauma inflicted on individuals, but organizations can carry trauma too. I once worked in a shop that didn’t ship new features for over a year, because of stability and process problems. It was a place where people routinely threw each under the bus because they were scared of being punished for making mistakes.

Matty used the analogy of a zebra being attacked by a predator. The zebra will play dead but shake off the event after, but humans can’t just shake off trauma. Organizations can operate in fight or flight mode, or even freeze.

I also appreciated this point that Matty made:

I think that telling and listening to stories is a basic human need, and good use of stories in incident reviews/retrospectives can help people learn.

Up next was Jennifer Petoff from Google, talking about how Google trains its SREs (view the video). I hadn’t seen Jennifer speak before and I really enjoyed her talk. I think there’s not nearly enough focus on training for DevOps and SRE folks, including training for oncall. Jennifer surveyed the attendees during her session, asking how people had learned in their jobs. Many of the attendees reported that they had learned more through “sink or swim,” or self-study, than through detailed onboarding. That’s been my experience in the industry as well. Jennifer said that the specifics of your org help determine how much to invest in training:

My favorite part of her talk was the acronym ASSBAT, which stands for A Student Should Be Able To. It’s a great way of defining educational goals.

One cool part of the conference was the visual drawings of the talks by Mind’s Eye Creative. I like seeing these at in-person events, but I think they are even more useful at remote events, where people might have to step away for a bit during a talk. Here’s the one for Jennifer’s talk:

You can find more of them on the Gremlin Twitter account.

The next speaker was Gunnar Grosch from Opsio, who has emerged as one of the experts on doing chaos engineering with serverless (view the video). Gunnar speaks a lot on this topic and has done a lot for the chaos engineering community. In his talk, Gunnar did a great job explaining why we should do chaos engineering:

He also touched on some of the differences in doing chaos engineering for serverless.

And Gunnar said something that still has me grinning:

He wrapped up with a great demo.

Resilience Engineering is becoming a much bigger topic in DevOps and SRE, and Amy Tobey from Blameless did a great job explaining a lot of the key concepts (view the video). One of those is that the systems we build and operate are sociotechnical systems:

Another big Resilience Engineering concept is adaptive capacity.

Amy also discussed topics that are very important right now, like cognitive load, common ground, and joint-cognitive systems.

I’m familiar with a lot of these concepts but still learned some things.

Up next was Taylor Barnett from Transposit, talking about automation (view the video). Taylor spoke about what she called human-in-the-loop automation:

Her hand-drawn slides were adorable. Taylor also gave a shoutout to a paper that’s a favorite of mine and many other people: Ironies of Automation by Lisanne Bainbridge.

If you haven’t read it, I strongly recommend it. Direct PDF link is here.

And she talked about mental models, a critical topic.

This was a great talk.

A few times during the day I skipped talks because my brain was getting full. As much as I liked the event, I did wish there had been a longer break or two in the schedule, like a traditional lunch-sized break. At the same time, I can probably guess some of the reasoning behind not doing that. Attendees were in different time zones around the world, so there wasn’t one clear lunchtime. And I have a feeling the organizers anticipated people coming and going some during the day. I’m sure many people did like me and skipped a talk or two to get some time to decompress a little.

There were some shorter breaks between talks during the day, where videos from sponsors played (FireHydant’s is here), and people met one on one in icebreaker sessions to discuss questions that were posed by the sponsors. I found the icebreaker session I participated in to be a lot of fun. There was also a #hallway-track channel in the Slack for more random conversations and Q&A Slack channels for each talk. I think the conference organizers did a great job of finding ways for people to connect at a virtual event.

The next talk I saw was Danyel Fisher and Liz Fong-Jones from Honeycomb talking about pitfalls in measuring SLOs (view the video). It was a great topic that I was pleased to see addressed. Unfortunately, Danyel had some audio problems, but there was a lot of great information in the talk. I’m happy that I caught it. Measuring SLOs is something that I think is easy to understand at a very high level, but can be more difficult when you get into the specifics. One thing I loved about this talk was Liz and Danyel sharing details from some of their incidents, and also about their SLOs.

Liz worked as an SRE at Google for many years, and she’s one of the main experts on SRE practices in my view. I’m always happy to see her talk about these topics. I thought this concept was very important:

(If you’re interested in SLOs, I recently interviewed Alex Hidalgo, who is writing a book about them for O’Reilly Media.)

Heidi Waterhouse from LaunchDarkly did a very interesting presentation called “Y2K and Other Disappointing Disasters: Risk Reduction and Harm Mitigation” (view the video). This is one of the sessions I was looking forward to the most, as Heidi is a fantastic speaker. I was also in charge of Y2K auditing and remediation for my Ops team at WebMD in 1999, so it’s an example that resonates with me a lot. People often see Y2K as an overhyped event, but the reality is that lots of us worked very hard to make sure potential problems were avoided.

This is super relevant to what we’re experiencing right now with COVID-19, as some people are pointing at successes achieved by social distancing as evidence that the threat was overhyped. That’s actually something that some epidemiologists predicted.

I love this idea:

I often try to think about what I can control versus what I can’t, as a way of managing my anxiety. But I’ve not gone as far as creating lists of those things. It seems like it could be very helpful.

Heidi mentioned that failure is inevitable but disaster is not, and I loved that distinction. Disasters tend to occur due to multiple failures, not just one, as Dr. Richard Cooke pointed out in his classic paper How Complex Systems Fail (Direct PDF link).

Heidi touched on a lot of other important topics, like testing restoration of backups, and reliability patterns like kill switches and circuit breakers.

This was a great talk, and if you enjoyed it I also recommend Heidi’s super relevant talk from 2018 called Disaster Resilience the Waffle House Way, about the Waffle House Index.

Last up for the day was my friend J. Paul Reed from Netflix talking about Resilience Engineering (view the video). I met Paul several years ago at Monitorama, and he’s someone that I’ve learned a lot from. He was my main entry point into the ideas behind Resilience Engineering and that community, which I’m very thankful for. Paul is a pilot and his talks sometimes include examples of airline safety incidents, and I’ve joked that he’s the reason I couldn’t get on a plane without taking a Xanax for a couple of years. I had one prediction for his talk:

Narrator: He did.

This was a super thoughtful and helpful talk, as I expected. Paul has a long background in the field of Safety, and also a lot of empathy. He started out with an overview of some Resilience Engineering concepts and then moved on to talk about how they related to COVID-19.

Adaptive capacity came up again, and it’s a very important thing right now.

I agree with Paul that we have to think of this situation we’re in as a marathon, not a race, although I love this thought that Jez Humble tweeted at me recently:

Some teams are reacting to the uncertainty we’re in by slowing down deployments or adding additional process overhead. One of my favorite things Paul mentioned was the idea of the people at the Sharp End of a problem (practitioners who are dealing with directly), and people at the Blunt End (people separated from the situation, like management or a Change Advisory Board).

It’s a very different thing for a team to say they need to stop deploys for a while, than for upper management to decree that everyone will stop deploying. The teams that are closest to the deployments are the ones that are going to have the best view of the situation, and this is why it’s important to give teams autonomy. Other people I respect like Dr. Nicole Forsgren have also talked about the dangers of top-down code freezes right now.

Paul also talked about the cognitive load of the reality we’re currently in, which I agree is a huge issue. He offered some strategies for coping with it.

The thing that did end up terrifying me in Paul’s talk was this amazing video. Take a moment and watch it if you missed the talk:

I started freaking out at the point when the plane descended below the highway signs. I was pretty certain that it would end up in some sort of crash. But as Paul pointed out, the plane landed safely and the folks around it on the road all adapted. He also mentioned that the people adapting were all at the sharp end of the problem, the ones closest to it. There was no police or government presence telling people what to do, they just reacted and adapted. While the video had me on the edge of my seat for a moment, the message ended up being positive. This was a great talk to close the day out.

I really enjoyed Failover Conf. I expected it to be a solid event, as I previously worked at Gremlin and helped organize conferences with the people who made this one happen. The Gremlin Events team is top-notch (hi @kimbrelancaster and @whereiskarli), as are the other people involved. Congrats to them all on such a great conference.

The organizers clearly thought about finding ways for people to connect, and I appreciate that effort. In the end, an event like this isn’t a total replacement for meeting people in person. The hallway track is hard to replicate, even with a Slack channel for it. But I think this event was extremely well planned and executed. It may be as close as we can get to an in-person conference online.

Failover Conf Wrap-up

See FireHydrant in action