A single source of truth: how CircleCI got 200 engineers in lock step when it comes to incident management
By bringing in FireHydrant to help improve their incident management practices, CircleCI has created a single source of truth that has helped them onboard engineers more easily and get them comfortable declaring and managing incidents faster.
Thousands of companies use CircleCI to automate testing and development cycles so they can focus on building better products. The CI/CD platform offers “confidence in every commit” — so their own team has to be extremely confident in their ability to keep their product up and running smoothly in order to build and maintain customer trust.
“Reliability is not a feature for us; it’s table stakes,” Erol Blakely, CircleCI’s director of platform engineering (formerly), told us. “When you offer a critical service, you have to have your shit together.”
A major part of ensuring that their product is able to meet customers’ needs is having fewer incidents overall and resolving the ones they do have more quickly. “We have to be able to respond effectively and swiftly — we can’t stab around in the dark,” Erol added.
But as the company experienced rapid growth in their engineering organization, they began to see deficiencies in their in-house incident management program; the manual processes they were using were no longer cutting it. By bringing in FireHydrant to help improve their incident management practices, they’ve created a single source of truth that has helped them onboard engineers more easily and get them comfortable declaring and managing incidents faster.
Or as Erol put it, “We needed everything in FireHydrant because we need 200 engineers all in the same place.” And now they’re looking at even greater investment in incident management.
Like a lot of organizations using home-grown incident management, CircleCI had manual processes involving disparate spreadsheets, docs, data inputs, and Slack channels spread across teams and product areas that proved hard to keep updated and even harder to train new engineers in. This all meant more manual work during incidents themselves, including looping in the right people and teams, remembering which spreadsheets to record what data in, and finding the right information to guide decision making.
They needed to keep everything in an easily accessible place and get every engineer working in the same way in order to resolve incidents more quickly and learn from them to work toward their goal of having fewer overall. FireHydrant gave CircleCI a framework to replace tribal knowledge with uniform documentation, enforce consistent practices across teams, and remediate incidents faster by giving everyone the right information at their fingertips.
“Before FireHydrant, we had things all over the place – it was hard to find data, you’d have to look back through the incident Slack channel to find notes,” said Engineering Manager Rob Braden. “Now, we import that info into FireHydrant. It’s given us a common language and tooling to share.”
This additional structure has also helped the team identify priority areas when it comes to improving overall availability and reliability, a company-wide goal for CircleCI. For example, they use data from FireHydrant to feed internal SLOs and track against the team’s goal of reducing MTTR. By standardizing reporting, analytics, and retros, they’ve been able to pull out common themes to identify areas where more investment is needed.
In addition to getting everyone on the same page in terms of process and communication during incidents themselves, FireHydrant Runbooks also help the CircleCI team build confidence in declaring incidents and running drills.
By using a tutorial runbook that maps to production (but without the paging), new engineers run “game days” as part of onboarding. Having a good practice framework for an engineer that’s only three months in — and may feel some anxiety around declaring an incident — is a huge benefit, Rob told us.
“New incident commanders or engineers don’t want the first time they run an incident to be a live one,” he said. “Game days give them the context and confidence that they’ll be able to run an incident by letting them practice outside of production.”
CircleCI has seen great improvements by using FireHydrant as a tool to automate incident management practices, but the working relationship with their team is what really sets FireHydrant apart and helped CircleCI build a best practices-based incident management practice they believe in, Erol told us.
“It’s given us a lens to examine how we handle incidents,” he said. “And the more we use it and the more we talk to the team at FireHydrant, the more we realize how much of a priority maturing incident management is for us.”
Next up, CircleCI has plans to make more use of the FireHydrant Service Catalog. Erol told us they’re looking at this as an extension of the “source of truth” they’ve come to view FireHydrant as: “We see the value of the source of truth — the roles, the organization, the services — that Service Catalog brings.”
See FireHydrant in action
See how service catalog, incident management, and incident communications come together in a live demo.Get a demo