Beyond the Headlines: The Unsung Art of Software Outage Management
When the internet breaks, who fixes it? Dive into the high-stakes world of tech crisis teams. From virtual war rooms to digital firefighting, uncover how software teams battle major outages.
By Robert Ross on 7/19/2024
Today, the entire world is feeling the pain of a major software outage. While we know a lot about these occurrences—our entire business is built on helping companies manage incidents and outages effectively—we’re not here to share our opinion on it. Instead, we’d like to help those unfamiliar with the incident lifecycle understand what happens when an outage like this occurs, who is responsible for what, and what companies ultimately do to get things working again.
What constitutes an incident/outage?
You may have heard terms like incident and outage used interchangeably, but they are two separate classifications dependent on how significant the impact is to downstream customers.
An incident is declared when there is something wrong with the technology stack that needs to be fixed immediately
An outage refers to a customer-impacting incident that causes an inability to access parts of an application (or even an entire system)
What defines the difference between the two can vary from company to company and depends on internal definitions of “impact”. An incident may not be a total outage, but an outage is always an incident.
Just as fire departments have a leveling system (i.e. a “5-alarm fire”) to indicate how severe a fire is, engineering teams have a similar measurement called incident severity. It’s commonly referred to with “SEV” followed by a number, and it indicates the level of impact an incident has on its users.
Low severity incident (SEV3, 4, or 5) - You usually won’t hear about these; maybe an internal system or rarely used piece of functionality is not working
High severity incident (SEV1 or 2) - If you’re a regular user and a platform experiences one of these, then you’ll know about it. Examples are a page not loading or a part of the application not working
Urgent severity incident (SEV0) - If the company is big enough, it will be headline news
Companies the size of Microsoft or Crowdstrike can have dozens to hundreds of incidents a week, but most customers won’t know about them. This is because the impact of these incidents is minor and contained quickly. But there are plenty of examples of SEV0 incidents in the news that you may remember: the 2021 Facebook outage, the Eras Tour Ticketmaster outage, and the recent mobile roaming outage that impacted AT&T, T-Mobile and Verizon, to name a few.
What happens as an incident unfolds?
When a team declares an incident, it typically goes through three phases: detection, assembly, and mitigation.
Detection
Businesses and their software teams have a wide range of software-powered “smoke detectors” that set off alarms when an incident happens. These teams are on-call, much like doctors . They will be paged via phone or email when a significant issue occurs. For the outage today, hundreds, if not thousands, of on-call engineers were almost certainly paged to respond to the detected incident.
Assembly
Once a major incident is declared, a massive mobilization occurs. Software engineers, public relations specialists, executives, customer support staff, lawyers, and more are brought into a “war room” of sorts to diagnose the problem, coordinate communication, and update stakeholders on any potential customer-facing impact. If the company has people in an office, it’s not uncommon for teams to take over a conference room where you’ll see several engineers hunched over their laptops diagnosing and addressing the issue. For remote teams, there’s almost always a dedicated conference bridge (Zoom, Webex, MS Teams). It’s common for a team to receive a page and then be on a call or in a room within a few minutes to address the issue.
Mitigation
Once the initial team has assembled, the most critical thing they must do is stop the bleeding and prevent the impact from spreading further. Incident response teams have a variety of tools that they can use to diagnose the source of the issue, identify the customer impact, and communicate with the best team to fix what’s wrong. You may see a variety of graphs and dashboards being used to determine when the incident started, if there were any software updates that happened recently, and if any customer data was negatively impacted (hopefully not!).
Who does what in an incident?
Incidents are high-stress, high-stakes events where every second matters. Software teams have clear roles in an incident to provide clarity and speed to resolution. These roles have a hierarchy based on FEMA’s incident command system (yes, that FEMA).
Roles:
Incident commander
The commander is usually the primary leader for a given incident. The commander is responsible for coordinating response efforts, ensuring communication goes out in a timely manner, and shielding responding engineers from distractions.
Responder
Responders are the people primarily responsible for remediating the issue. Often, they are subject matter experts for a specific part of an application or a special team of engineers trained in incident response practices.
Communications
The communications lead manages internal and external communication channels. The more severe an incident, the more vital it is to have a dedicated communications lead to coordinate engineering, PR, customer support, legal, and executive teams so everyone can access up-to-date, accurate information.
What about after an incident?
After an incident has been mitigated, a retrospective or post-mortem is often scheduled to discuss the root cause analysis (RCA), contributing factors, and next steps to ensure the problem doesn’t happen again. These retrospectives are ideally blameless towards individuals and focus on actionable tasks to be done to improve response time, decrease risk factors, and build toward a more fault-tolerant system.
Often as a result of these meetings, teams will produce a document that can be shared more widely and sometimes publicly. Here are a few of our favorites that balance an explanation of the technical issues with appropriate takeaways for how they plan to improve in the future:
Netflix (which was taken down by the 2011 AWS outage)
Conclusion
Software incidents of this scale are a business’s worst nightmare. However, with proper communication and tooling, teams can mitigate issues, communicate with stakeholders, and maintain customer trust.
Shout out to all the first responders affected by today's major incident. We know firsthand what it’s like to be that on-call engineer who gets woken up to a massive outage. This one will certainly go down in history.
P.S. If you’re an engineer who reads this going “I wish our team did that!” then maybe you should submit a demo request and see how FireHydrant makes these processes invariably easier with automation.
See FireHydrant in action
See how our end-to-end incident management platform can help your team respond to incidents faster and more effectively.
Get a demo