OOPS! Learning from Surprise at Netflix

Many of you are stuck at home looking for things to do, so I thought this would be a great time to share some of the talks I’ve seen that I love about incident response, SRE, and related topics. Before you dive in and watch Tiger King for the third time on Netflix, why not learn about how Netflix investigates incidents? This first entry in the series is an excellent talk from Lorin Hochstein from Spinnaker Summit 2010 called “OOPS! Learning from Surprise at Netflix.”

Lorin is one of the smartest folks around in my book when it comes to incidents. He’s on a team with that’s focused on investigating incidents, and in the talk he shares some of the things that he’s learned. At Netflix they refer to an incident as an “operational surprise” or an OOPS. I love that term, as it emphasizes that our ability to respond to unexpected events is important.

In the presentation, Lorin talks through several examples of incidents related to Spinnaker, the tool that Netflix uses for orchestrating deployments. I’m always happy to see people talking publicly about their outages, and Lorin’s examples are fascinating. In one of the cases, a change that was made to help with reliability ended up making the system less stable, which is a situation that Lorin says isn’t uncommon. I’ve also seen it crop up in my career. Another example involves unexpected autoscaling behavior, which is something I’m sure many of us have encountered at some point. Lorin also gives tips on investigating incidents, like interviewing participants and understanding their context at the time they took actions.

Enjoy the talk and follow Lorin on Twitter for more of his insights. He also maintains a great list of resources for people that want to learn more about Resilience Engineering on GitHub.

OOPS! Learning from Surprise at Netflix

See FireHydrant in action