Consistency in a complex environment
By creating playbooks for their services as Runbooks in FireHydrant, Avalara is able to get the right people in the right place at the right time faster, as well as document service-specific nuances and processes for the many applications monitored.
Trust is everything for a company like Avalara, which helps its global customers automate and navigate tax- and compliance-related tasks. In order to keep the confidence of their customers, the Avalara platform must be reliable and scalable — which includes minimal downtime — so they employ a 24x7x365 monitoring program to identify and resolve issues around the clock.
However, with a widely distributed team of engineers and applications, enforcing consistency in how issues were declared, mitigated, and resolved was a tall order.
“When arriving at Avalara, the incident process in place was manual and incidents were not tracked as consistently as I wanted,” said Jim Bachesta, Senior Director of Reliability Engineering. “We did individual training, which created the challenge of needing to retrain regularly. We needed a better plan to ensure a smooth incident response process.”
Jim saw the need for a formal incident management process to knock down knowledge silos and ensure consistency across the team. Jim brought to Avalara Incident Management skills from the public sector, where the standard is based on Incident Command System (ICS). He knew that ICS would scale given that it is used by Federal Emergency Management Agency (FEMA), across many disciplines, including fire, police and regional emergencies. Given this, ICS was a core requirement for Avalara’s Incident Management improvements. Jim brought FireHydrant on board in 2020 because it supported ICS, as well as the ability to create customer incident response playbooks that are configurable and integrate with the technical stack that was already in use — Slack, OpsGenie, Jira, and Status Page. Two years later, the team is using FireHydrant to quickly respond, track program metrics, train responders, perform resilience testing, inform execs and C-level staff and more.
Consistency in a complex environment
Incident management falls under Avalara’s Engineering Operations Center (EOC), which Jim leads. The group monitors dozens of applications around the clock with people in the United Kingdom, India, and the United States. They use sophisticated dashboarding across their products and alert based on key metrics such as error rates and latency. The philosophy is that anyone can flag an issue to the EOC, who then serves as incident commander and can bring in other experts depending on the severity level and the level of expertise needed.
With an environment this complex, it’s understandable that the team was experiencing knowledge silos around services, duplicated tasks, and a lack of clarity around incident response processes.
“Humans just aren’t good at consistency,” said Jim. “We needed a consistent process that ensured everyone followed the same steps and protocol when we had an incident. FireHydrant has helped us with that.” By creating playbooks for their services as Runbooks in FireHydrant, the team has streamlined their incident response process across the board. They’re able to get the right people in the right place at the right time faster — and document service-specific nuances and processes for themany applications monitored.
“We manage a matrix of people that is integrated within the workflows and created a set Runbooks to notify all stakeholders via Slack, text, and alerts regarding the incident in their department,” Jim told us. “They’re notified based on severity level and get periodic updates on the status, too.”
Now, everyone knows their role when an incident is declared, and the team knows how to reach relevant stakeholders, as well as when and how to communicate with both stakeholders and customers. “We identify an issue, we engage the right resources, resolve issues, ensure that the related teams are involved in the process… and the foundation of that is FireHydrant,” Jim said.
Continuous learning to drive reliability
Because they’ve created a low barrier to declaring an incident, FireHydrant enables Avalara to continuously improve their incident management process. Anyone can open an incident to investigate or talk through an issue, which helps the team not only identify issues before they escalate but also practice their incident response process. This ultimately keeps stress levels lower during higher-severity incidents because people know what to do.
Avalara also leverages FireHydrant for training exercises for EOC team members. Playing off the traditional orange vs. blue simulation, Avalara’s orange team inserts problems in the staging environment, and the blue team must identify and resolve incidents. This helps Avalara to practice incident management, test alerts and develop a deeper understanding and knowledge across the technology stack. The simulation has been such a successful training exercise that Avalara has expanded its use. The company also uses production-ready checklists and testing in FireHydrant before launching a new product or application, ensuring the EOC knows the product before an incident occurs. The checklist validates alerts, builds out runbooks, and verifies staff escalations processes. Ultimately, this training and preparation has helped the EOC scale without bringing on additional headcount.
“We learn through improvements we make to our incident process — from how we interact on Zoom calls to the way we communicate in Slack,” said Jim. “We proactively look for things to ensure we can deliver the highest level of performance. FireHydrant is really helping us enhance our incident management maturity and helps ensure we have it right.”
Avalara received a discount on the price of FireHydrant’s services in exchange for doing a testimonial.
See FireHydrant in action
See how service catalog, incident management, and incident communications come together in a live demo.Get a demo