Incident response change management without disruption

Snyk's SRE team decided it was time to bring in an incident management tool to enable consistency and scalability. But they didn’t want just any tool; they wanted one that would integrate with their existing systems and enhance their performance.

Snyk is a developer security platform that helps users find and fix vulnerabilities faster. They're growing and evolving quickly, so incidents are a fact of life, albeit one that still requires swift attention.

Snyk’s company culture encourages ownership and independence, a philosophy that extends to incident management. Amir Mehler, SRE manager, calls their philosophy NoOps: “NoOps means you don’t have an ops team. You build it, you run it. Everyone carries a pager, and it’s part of their duty.”

The NoOps approach meant that for a long time, incident management at Snyk was highly decentralized. Each team had their own product onboarding guides, and they could differ completely from team to team, Ben Cordero, staff SRE, told us.

Early last year, the SRE team decided it was time to bring in an incident management tool to help enable consistency and scalability to their process. But they didn’t want just any tool; they wanted one that would integrate with their existing systems and enhance their performance.

Balancing consistency and flexibility

As the company grew, it needed to consolidate its approach to incident management. Disparate processes throughout the company made achieving a standard level of quality and training for new hires difficult. In fact, an internal survey showed that 63% of engineers never received formal on-call training.

It was important for Snyk to find the right tool to consolidate its incident response process and learnings without being prescriptive. They didn’t want to force teams to change processes that were already working well. In Q1 of 2022, they began an extensive vendor analysis.

“We had a whole list of requirements we wanted to use, both based on Snyk’s existing incident management process, and the world that we wanted to go to,” Cordero told us. That list included analytics capabilities, predefined roles, and API access.

Perhaps most importantly, however, is that Snyk wanted a tool that meshed well with their existing processes and philosophy around ownership toward incidents, without introducing new workflows.

“If people can do practically everything in Slack, I didn’t want this to be the tool that limits them,” Mehler said. FireHydrant’s ability for anyone to declare and then manage an incident Slack ultimately led Snyk to choose FireHydrant.

Change management made easy

Snyk implemented FireHydrant in June 2022 with a SRE team of four people. Onboarding was simple: the SRE team made a 10-minute video on how to use FireHydrant, announced it to the team, and everyone was using it within the week, they told us.

This is no small feat — with a global team of more than 400 engineers, Snyk needed to ensure the rollout was frictionless to fit with the NoOps culture.

Cordero described the onboarding experience as “fast, frictionless adoption. From the week FireHydrant became available to the team, we were quickly at 100% of the business using it, without much friction and no outages.”

Part of the reason adoption was so smooth is that FireHydrant’s tool mapped easily to Snyk’s existing approach to running incidents. “We didn’t have to do too much in terms of retraining the process,” says Cordero. “We immediately got benefits without disruption.”

Data-based decision making

After a seamless installation process, Snyk’s SRE team turned their energy toward using FireHydrant to improve their approach to incident mitigation.

FireHydrant helps standardize and measure the incident response process so the SRE team can leverage data to improve how they manage incidents at Snyk. “Previous to this, Amir and I were literally just going through the incident announcements channel and counting the number of incidents by hand,” Cordero told us. The SRE team also established what they call their incident response guild, a group of people across the company interested in improving how incidents are run at Snyk. The guild helped create custom incident response playbooks with FireHydrant’s configurable Runbooks to standardize the incident management process, available to any user, FireHydrant has helped the team better document and distribute these resources across the organization.

“FireHydrant is going to be one of the foundational tools for us as a team for communicating SRE practices throughout the company,” Cordero said about this initiative. The guild also helps run company-wide incident review sessions, where engineers present an incident and get their work and recommendations in front of higher-ups at the company. These meetings aren’t mandatory but are popular, often attended by 50 to 100 people with high engagement.

Now readily equipped with data, the next step for the SRE team is synthesizing that data to make incident response at Snyk even better.

“We’ve been through the phase of not having data and then having data that’s not necessarily relevant or true,” Mehler said. “Now we’re finally at the phase of having the right data for the right things, and we can start making decisions based on that data.”

See FireHydrant in action

See how service catalog, incident management, and incident communications come together in a live demo.

Get a demo