At a going away party from a job I was leaving a few years back, my VP of engineering told a story I didn’t even remember but that I know subconsciously shaped how I viewed my role on that team: Toward the end of my very first day at the company, there was some internal system issue, and with pretty much zero context, I pulled out my laptop, figured out what was going on, and helped fix the issue. After that, every time there was an incident, this VP said as soon as he saw my name in the Slack chat, he felt a sense of relief and knew we’d be okay.
If you’ve been this person at your company, you know how good that can feel. You also know how exhausting it can be. By taking some steps to operationalize the hero’s work you’ve been doing, you can actually set yourself and your organization up for long-term success.
We can’t all be Shaq
Mine isn’t a unique story. In fact, most organizations I talk to have a handful of people or teams who swoop in and save the day when a technical crisis arises. I like to refer to these folks as Shaqs.
Shaquille O’Neal is one of the most celebrated NBA players of all time. He played for six teams over his 19-year career and won countless awards — and for good reason. When the team needed two points, they knew they could throw the ball to Shaq, and let him go to work. He’d inevitably push someone around (hard not to do at 7’1” and 300+ pounds), dunk the ball, and the crowd would go wild.
The skills (and probably the size) are different, but there are a lot of engineers playing the Shaq role at their company. When the 2 a.m. outage page goes out, they’re the first to respond. These nebulous heroes find the problem, determine the affected areas, fix the issue (or know who to call to fix the issue), wake up the VP, draft messages to send to customers and stakeholders, create tickets to address why things went bad. Then at 9 a.m., they go back to the job they were hired to do. Backs are patted, and life goes back to normal until the next 2 a.m. page.
So what’s the problem here? Games are getting won, incidents are getting solved, it’s true. But are you truly setting your organization or team up for success by just “doing it for them” every time? A win doesn’t always have to come from a backboard-shattering slam dunk, and relying on the most dominant player to save the day isn't a scalable solution for an organization’s continued success.
It feels good to be Shaq. Unfortunately, Shaq cannot play every minute of every game in a year. He’d be risking burnout and injury, blocking the way for the creation of a better team with better technique, and when he finally did injure himself, it would be disastrous for the team. And the same downsides are true for incident management heroes.
It’s time to pass the ball
One of the major issues with this hero scenario is that because incidents are getting remediated — and nobody else is really feeling the pain — your company might not think there’s a problem. And without the tools or headcount to make a sweeping change, it turns into a self-fulfilling prophecy: there’s not a great system in place but you know what to do, so you just end up continuing to do it … which negates the need for a system and puts the onus back on you.
To get out of this cycle, you have to commit to changing reliability culture at your organization. I always recommend people start with a few small steps and then grow from there.
1. Start documenting what you do during an incident
As a company’s technology platform evolves, SMEs and responders can be engineers from varying technical backgrounds and specialties who may not have the technical, tribal, or social knowledge to understand all of the intertwined components in operation. If the contents of your head make up your company’s service catalog, dependencies documentation, and incident management communication workflow, you’re in trouble. What happens when you’re not around? Your team might be looking at a big fat L on that day.
Take the first step toward formalizing an incident management runbook. This doesn’t have to look like setting aside a full day to write a step-by-step process. Instead, take the start-small approach of talking to yourself during an incident. The next time you respond, start a thread in the incident channel where you literally just think out loud. Be over-communicative, being sure not to assume your teammates understand why you’re taking the actions you are. Think in terms of:
I just got paged, what’s the first thing I do?
Where are the places I look to check on the status of our services?
How do I know who to call when I discover what service is down?
How do I know how to revert the last deploy for that service?
What impact does this incident have on customers and internal teams?
What are my thoughts on how to fix this issue going forward?
These are all the questions you’re answering in your head in an organic way because you’re the one who knows how to do this. By documenting your process, you’re taking the first step to getting that info out of your head and eventually into an incident management tool or company Wiki, breaking that silo.
2. Put yourself into an incident commander role
You know who looks up to Shaq? Everyone (and not only cause he’s a big dude). And the same is probably true of you. Give the other members of your team the opportunity to learn, to expand their own skills, and to bask in that hero’s glow. A team benefits from players who have differing specific roles and specialties all working together under sound coaching and organizational direction.
The best responders facilitate communication and collaboration. So the next time an incident arises, instead of taking on the Shaq role, take on the Phil Jackson role, and act as coach. Simply hang back. Provide guidance, give people the info they need (or better yet, help them figure out where to find it), and lend a hand when you’re asked.
Once you’ve done this a couple of times, maybe miss a game or two. See if not being on call is an option for the sake of your own health but also to help others step up. Instead, work on a special project (like taking the first steps toward formalizing the documentation you started from the first step). There’s no better way to flag a weakness up the chain of command than by demonstrating what happens when you’re not there to fix things, and that’s a surefire way to direct some resources toward incident management.
3. Question the status quo
What does on-call look like at your company? Depending on where your systems are (i.e. how many incidents you’re having and how often), being on-call for a week might be a totally unreasonable amount of time. What tools are you using to manage your incidents? Why were they chosen, when, and by whom? How are incidents declared and under what circumstances? Are you alerting on every down instance, or the ones that impact your customers? Related, are you using service level objectives? If not, what’s standing in the way of adopting them? Are you conducting retrospectives? How do they influence roadmaps?
We have to improve the on-call experience for engineers in order to reduce burnout, retain our colleagues, and work on the projects that directly impact our revenue-producing products and features. There is a lot of opportunity cost associated with a messy incident management program — and a lot of benefits when it’s done well. Direct your hero energy to making the entire process better, not just remediating ad hoc incidents, only to have them pop again later.
A winning strategy
These first steps are all moving toward a common goal, and that’s to move away from whack-a-mole-style incident response to more strategic and holistic incident management. If you’re playing the hero role at your organization, you might be unintentionally masking the need for better incident management practices. This isn’t your fault though, and you’re not alone. By helping our companies shift toward a better incident management posture, we can improve things for our customers, for our teammates, and for ourselves.