FireHydrant Incident retrospective: June 24, 2022
Between 2022-06-23 20:25 and 022-06-24 21:39, FireHydrant experienced an incident resulting in customers being unable to authorize the FireHydrant Slack app. This is the incident retrospective.
By Jouhné Scott on 7/7/2022
What follows is a technical incident report that details a recent FireHydrant incident and its resolution. But because we're focused on helping organizations improve their reliability, we want to call out an important learning so that you don't have to read between the lines: institutional knowledge is a reliability killer. And it's especially painful when what appear to be novel failures are immediately obvious to those with historical context. Teams must invest in actively extracting this knowledge on an ongoing basis. “Subject Matter Expert” is really just a euphemism for a single point of failure.
Between 2022-06-23 20:25 and 022-06-24 21:39, FireHydrant experienced an incident resulting in customers being unable to authorize the FireHydrant Slack app. Clients saw the following error message on our customer-facing application (app.firehydrant.io).
Initially, we thought this issue was related to clients needing to update their Slack integration in order to receive the new bookmark functionality. Upon further monitoring, we discovered that upgrading the app did not mitigate the issue. We identified the actual issue as an unneeded scope. This post describes the timeline, contributing factors, and changes we’ve made going forward to prevent similar failures in the future.
FireHydrant shipped a new feature in Slack that required additional scopes. The app went through Slack’s app review and approval process. During this process, Slack provided feedback that we could drop the bookmarks:readscope; we agreed to drop it since we were not using it and did not want to ingest more potentially sensitive data.
Updates from the Slack app review team were relayed to engineers instead of receiving them directly, leading to asymmetry in the information known by team members and delays in team response.
2022-06-23 20:25 Deployment went out that added Slack channel bookmarks for Command Center and Internal Status Page
2022-06-24 17:52 A customer support ticket was submitted indicating that a user received an error when trying to link Slack/FireHydrant
2022-06-24 18:00 Incident Management team members discussed this error and initially thought the error was due to clients needing to upgrade their Slack integration to receive the new updates.
2022-06-24 18:14 An incident was declared to loop in customer success team members and get communication out to our customers informing them of the needed update.
2022-06-24 17:03:00 The following communication was sent to customers:
Subject: FireHydrant Slack Update | Action Required We recently added new functionality to the FireHydrant Slack application, which now requests an additional OAuth scope; this requires our customers to update their Slack Integration in order for users to link their FireHydrant and Slack accounts. We apologize for the inconvenience and moving forward, we will provide updates ahead of time. See Slack Integration update steps below:
Org Owners should click the Integration link in the left navigation bar
Click the pencil icon to edit the Slack integration
Click `Upgrade your Slack Integration` button
2022-06-24 17:07 We updated the incident to Mitigated and monitored for any additional issues.
2022-06-24 17:46 Customer Success reported customers were still receiving the FireHydrant could not be installed error after clicking the Upgrade your Slack Integration button.
2022-06-24 17:49 We updated the Incident to Investigating from Mitigated and resumed trying to identify the issue.
2022-06-24 17:53 We identified the issue was that our application kept requesting bookmarks:read after Slack required us to drop it.
2022-06-24 17:56:00 We shipped a fix removing the bookmarks:read scope
🍂 Contributing Factors
Lack of structure around the Slack review process
We leverage feature flags to quickly deliver incremental progress on new features but that workflow isn't available to us for Slack. This all means we have to point them at a separate environment to validate the functionality, and for convenience purposes that have historically been our staging environment and Slack app.
Slack does not permit installing the App Store version of your app with scopes that are pending approval, even within your own organization. Slack Apps that aren't in the App Store don't have the same constraint around approved scopes, so these issues don't manifest until the changes are deployed to production. We have processes for how to manage these reviews but the rarity of the need to execute them means a lot of the knowledge wasn't captured in the documentation.
Inadequate review environment
We use our Staging environment for Slack review which means we have to take extra care in deploying Slack changes to that environment while a review is pending. Additionally we were unable to experience the error impacting our customers because it only manifested in Production. Slack doesn’t provide a callback or notification when the OAuth flow results in a failure, so we need to do manual validation of the scopes once they’re requested in production.
🧑🚒 Response Plans
Improved review environment
We are discussing how we can move the Slack review process to a stable environment that is representative of our production environment.
Lint for scope
We have new safeguards in place to ensure that Slack scopes requested by our production app have been approved in the Slack App Store. In the case that new scopes are unavailable, our engineers will be unable to deploy new scopes that aren’t available in our App Store app.
Loop in stakeholders
Slack app review communications now go to a larger group of individuals, including Incident Management stakeholders. This now allows members of the team responsible for the FireHydrant Slack App to receive updates directly from Slack.
Slack reviews are an incident
We’re utilizing FireHydrant to drive our Slack app reviews in the future. We’ve defined an Incident Type with attached Runbooks and Task Lists to capture the review from submission to approval and release to customers. We’re able to track the progress of the review, assign tasks to specific engineers and product management staff, socialize the progress of the review through our Internal Status Page and keep the rest of the team up to date with its progress.
If you have any questions about this retrospective, please reach out to firstname.lastname@example.org
See FireHydrant in action
See how service catalog, incident management, and incident communications come together in a live demo.Get a demo