FireHydrant + Recharge: consistent incident response, deeper learning

In 2022, Recharge experienced a spike in incidents that revealed that their homegrown incident management processes weren’t scaling as well as the team wanted. They needed a more streamlined and automated way to consistently declare, manage, and report on incidents and made the decision to bring on FireHydrant after team members mentioned the benefit they got with the platform at previous companies.

A year later, the Recharge team is using FireHydrant to respond quickly and consistently and learn more from their incidents. And since Recharge’s ecommerce customers use their platform to turn one-time buyers into loyal and repeat customers through subscriptions, they know that minimizing downtime and improving reliability is table stakes.

“Incidents suck, quite frankly,” said Manager of InfraOps Ryan Kish (who goes by Kish). “These incidents can directly impact our customers’ ability to generate revenue. And if your revenue stream is being impacted by your vendor, how long are you going to sit on that before you look at other vendors? It's very important to us that we're not just handling incidents as efficiently as possible, but we're reducing them so we can ensure that our customers are having the best possible experience.”

More consistent incident response#more-consistent-incident-response

Recharge has two teams that share the on-call burden — DevOps and InfraOps. The incident management program sits on the InfraOps team and is managed by Kish. During business hours, Kish’s team handles all of the triaging problems and incidents, but after hours or on weekends, they use an on-call rotation.

When a degradation occurs, the first thing the on-call engineer does is decide if it qualifies as an incident. If it does, whoever’s on-call puts on the commander hat and kicks the incident off. “We understand the process really, really well,” Kish said. “And our goal is to shepherd anyone else in the incident through our process.”

The commander declares the incident using Slack, then begins to assemble an incident response team that usually includes a technical lead — the person responsible for the area of degradation — and also a member of Recharge’s technical support team to serve as a communications lead.

The communications lead is responsible for doing things like updating status pages and getting information to customers, while the technical lead understands the affected program or application and takes the point on directing the troubleshooting and mitigation process. One of the ways the team enforces consistency is by using checklists for these roles, helping responders more easily remember everything they need to do.

“We had that with Slack workflows, but there was no accountability,” Kish told us. “So now you have to click that check to say, ‘Yes, I did this.’”

By using FireHydrant’s customizable runbooks for each incident type, the team quickly understands the scope of the response effort needed, like if external communications will be sent, for example. At first, the team was just using runbooks for their own service degradation, but in the last six months, they started managing partner incidents similarly. When a tool critical to the engineering workflow goes down, for example, the team gives it the “same level of attention to detail and resolves it as an incident because it is impacting our ability to do our jobs,” Kish told us.

“No one stands alone in the world of the internet,” he added. “We rely on other companies and their products to make our products work. So if a partner incident degrades our product, we need to make sure that our customers are fully aware of what's going on, which also helps reduce the burden on our technical support organization.”

An easier way to capture learnings#an-easier-way-to-capture-learnings

Although the Recharge team has always made reporting on and learning from incidents a priority, this was one of the areas where scaling was difficult with their old processes. The team collects and reports on a lot of information, including:

Number of incidents
Number of retrospectives
Uptime: All services, and broken into critical and non-critical services
Meantime to detection
Meantime to mitigation
Total incident time

“It's a lot less work for my team to assemble all of this information because of FireHydrant,” Kish explained. “Previously, we would have to go through and and and get timings out of Slack, and if you were having many, many incidents in a given week, my people were just shredded. We don't spend as much time building the postmortem report, because FireHydrant makes it so much easier to do things like port the timeline.”

This also leads to engineers getting that time back to work on their core work, Kish added. “That's time that my engineers could be engineering something rather than doing work that I could outsource to somebody that doesn’t have that level of skill.”

Using the FireHydrant API, Kish is also able to pull incident data from FireHydrant into a slide deck he generates weekly to help internal teams and stakeholders get a full picture of incidents (without making stakeholders seat holders). In this way, the team is able to continuously refine and improve its processes.

“There are a lot of people outside of engineering and technical support that would love to have access to the data, and that's where my reporting comes in,” Kish said. “It’s a way for me to bubble up this information so we can hold our teams accountable.”