The “people problem” of incident management

Managing incidents is already tricky enough, and you want to get to mitigation as quickly as possible. But sometimes it feels like organizing everything surrounding an incident is more difficult than solving the actual technical problem and you end up getting delayed or sidetracked during mitigation efforts. We call that scenario the “people problem” of incident management.

Some symptoms of the people problem include:

Too many well-intentioned people jumping into the incident channel to help or ask for an update
Mitigation efforts are getting unknowingly duplicated across the engineering org
Customer success is left hanging, waiting for updates they can pass along to customers

In short, you know you have a people problem when responding to an incident feels like herding cats.

If this sounds familiar, you're not alone. This has been a hot topic in a lot of conversations I've had recently. By putting some processes into place to clearly define roles and responsibilities, and set expectations for cross-department and stakeholder communication, you can draw boundaries around who does what during an incident while also setting everyone at ease. In this blog post, I'll tell you how.

Make sure everyone knows their role#make-sure-everyone-knows-their-role

Assigning roles and responsibilities during an incident is critical to help minimize confusion and ensure that the right people are working on the appropriate tasks. The Incident Benchmark Report, which analyzed 50,000 incidents resolved on FireHydrant, showed that when roles are assigned during an incident, the average length of incidents decreases by a whopping 42%.

Use a light hand though; the key to assigning roles is thinking about what your specific company needs and when. For example, for SEV1 incidents, you may assign a communications commander whose sole responsibility is keeping status pages, customer support channels, and stakeholders updated. For less severe incidents though, communications responsibilities might just fall under the incident commander’s role. Check out this blog post for a detailed rundown of incident roles and when you might use them.

Once you know what roles you need, make sure you make it clear what those roles are responsible for. Task lists can help ensure everything a role includes is completed, taking the overhead of remembering that list out of the hands of a busy responder. For example, the team at Recharge told us they use role checklists to enforce consistency and help responders more easily remember everything they need to do during an incident.

And then, practice! You don’t want the first time a new engineer sees your incident response process to be when they’re paged. Think about instituting periodic game days as a way to get everyone up to speed.

Open the lines to customer success#open-the-lines-to-customer-success

Effective cross-department communication is essential and can be challenging during an incident. Establishing communication protocols in advance ensures that everyone is on the same page. For example, you might determine based on severity level how often you’ll post an update in the all-engineering channel, or under what circumstances you notify the executive team of an incident. One line of communication you’ll want to always keep open though is the one to customer success.

Customer success should be considered a part of the response effort for any customer-facing incident. Some companies extend that to enable their CS reps the ability to declare an incident. After all, it’s not uncommon (or shameful) to have an incident identified by a customer. This collaboration promotes transparency and trust between the company and the customers and ultimately buys you grace during the incident.

In high-severity incidents, you might have a CS representative as one of the assigned roles on your responder team. For less-severe incidents, maybe one of the engineer responders posts an update in the CS team every 30 minutes. Find what works for you — and don’t forget to include CS in training exercises too.

Proactively offer periodic updates#proactively-offer-periodic-updates

Stakeholders in an incident include employees, customers, vendors, and the general public. Keeping stakeholders informed throughout the incident is crucial to maintaining their trust and confidence in your team and the organization. It’s also crucial to allow you to stay focused on the incident at hand.

Effective communication with stakeholders can minimize the perceived impact of the incident and ensure that the organization's reputation remains intact. And by building communication best practices into your incident response plan, you are able to set expectations that can even help you (kindly) boot internal well-intentioned but misguided voices out of the incident channel and toward a status page.

Snyk is a company that uses a cool tactic to do this. They told us they have a lot of people who are interested in incidents but don't need to be in the fray during response efforts. As a way to keep them engaged, the SRE team holds company-wide incident review sessions, where engineers present an incident and get their work and recommendations in front of higher-ups at the company. These meetings aren’t mandatory but are popular, often attended by 50 to 100 people with high engagement.

Some companies choose to use internal status pages in this way to communicate with leadership and interested team members and provide updates on the progress being made to resolve the issue. These pages often include information such as which systems are affected, what the root cause of the issue is, and what steps are being taken to fix it.

External status pages, on the other hand, are used to communicate with customers and other stakeholders who may be impacted by the incident. These pages typically provide information on the current status of the issue, estimated time to resolution, and any workarounds that may be available. They are important for ensuring that customers are informed and have realistic expectations about when the issue will be resolved. Read more about status page best practices.

Put it in play#put-it-in-play

If you're spending more time managing a "people problem" than you are mitigating incidents, it might be time to think about implementing some of these best practices. By tackling the people problem head-on, you can navigate incidents more efficiently while maintaining your organization's reputation and keeping stakeholders informed. Read more about ways to improve your incident management program in our latest ebook: 3 ways to improve your incident management program in 2023.