Service Status Update: April 29, 2026
On April 29, 2026 at approximately 18:44 UTC, we experienced an issue where the FireHydrant Web UI failed to load for users across all regions due to an unexpected configuration issue with our content delivery network (CDN) account.
Synopsis#synopsis
On April 29, 2026 at 18:44 UTC, our internal synthetic monitors began alerting on Web UI login failures. The root cause was traced to a configuration issue with our CDN account that prevented our static assets from being served. From the moment the issue was detected, our API remained fully reachable, and the impact was limited to the web interface that depends on the affected static assets.
The first synthetic recovery confirming Web UI restoration was observed at 19:17 UTC, 33 minutes after the first synthetic failure. Slack and Microsoft Teams incident management, Signals alert delivery, and our public API were not impacted at any point during this incident. Customers using the Slack or Teams integrations were able to declare, manage, and resolve incidents normally throughout the event.
Our synthetics caught the failure before any customer reports came in, which allowed our on-call team to begin mitigation immediately. We were not able to update our public status page during the impact window itself. We have three mechanisms for posting status page updates, and each was unavailable for a different reason: our primary path through our Slack integration was misconfigured at the time of this incident; our secondary path through the Web UI was unreachable as a direct consequence of the incident itself; and our tertiary path requires our infrastructure engineer to use internal engineering tooling, but that same engineer was leading the CDN bypass that ultimately restored service. We are addressing each of these gaps directly in the corrective measures below.
Primary Issue#primary-issue
The CDN account that fronts the FireHydrant Web UI was rendered non-functional by an unexpected configuration issue that left the account unable to serve our static assets.
Because the Web UI loads its static assets through this CDN, the issue rendered the user interface unable to load for all customers across all regions. The FireHydrant API was unaffected and continued to serve traffic normally throughout the incident, which meant that programmatic incident creation, Slack/Teams workflows, Signals alert delivery, and webhook integrations all continued to function as expected.
Timeline (UTC)#timeline-utc
- April 29, 2026 at 18:44 UTC - The first Datadog synthetic alert fires: the Web UI Log In flow begins failing in US production. Engineering convenes a FireHydrant incident to triage.
- April 29, 2026 at 19:06 UTC - Root cause identified: a configuration issue has left our CDN account in a non-functional state. The team attempts to post a status page update and finds all three available paths unavailable: our Slack integration is misconfigured, the Web UI path is unreachable, and the tertiary engineering-tooling path requires the same engineer leading the technical mitigation. Customer-facing communication is deferred until restoration.
- April 29, 2026 at 19:10 UTC - Our existing CDN failover procedure is initiated but does not propagate quickly enough to meet restoration targets. The team pivots at 19:14 UTC to a direct CDN bypass of the affected static assets, while working to restore the underlying CDN configuration in parallel.
- April 29, 2026 at 19:17 UTC - First synthetic recovery confirms Web UI restoration. The CDN account configuration is also restored. The public status page is updated retroactively.
Total time to resolve: 33 minutes, measured from first synthetic failure (18:44 UTC) to first synthetic resolution (19:17 UTC).
Customer Impact#customer-impact
The FireHydrant Web UI was unavailable or degraded for all customers across all regions (US, EU, AU, IN) for the duration of the incident.
The FireHydrant API remained fully reachable throughout the event. Programmatic incident creation, Slack and Microsoft Teams incident management, Signals alert delivery, runbook execution, and webhook integrations all continued to function as expected. Customers were able to declare, run, and resolve incidents from Slack and Teams with no disruption.
Because the impact was limited to static assets served through the CDN, no data was lost, no notifications were dropped, and no incident or alert state was affected. Customers who attempted to access the Web UI during the impact window saw load failures or 500-class errors and were directed to the status page for updates.
Recovery#recovery
We pursued mitigation in two stages:
- Primary mitigation: Bypass the affected CDN for static asset delivery so the Web UI no longer depended on the affected account. Our existing CDN failover procedure was triggered first, but did not propagate quickly enough to meet our restoration targets. The team pivoted to a direct bypass of the static asset path. The first synthetic recovery confirming restoration was observed at 19:17 UTC.
- Follow-on remediation: Restore the CDN account configuration to a working state in parallel with the bypass. Configuration restoration completed at 19:19 UTC.
Once the bypass had taken effect and the underlying account was restored, all regions were verified through synthetic checks and manual validation. We were unable to post public status page updates during the impact window itself: our primary status page path (our Slack integration) was misconfigured, our secondary path (the Web UI) was unreachable as part of the active incident, and our tertiary path required the same infrastructure engineer who was leading the CDN bypass.
Corrective Measures#corrective-measures
We are taking the following actions as a direct result of this incident.
- Speed up CDN failover. Our existing CDN failover procedure did not propagate quickly enough during this incident. A change to accelerate the failover path is already in flight and will be merged and validated as a top priority, so future provider issues can be mitigated by an automated failover rather than an in-incident bypass.
- Stand up redundancy for non-US regions. EU, AU, and IN origins currently rely on a single CDN provider. We are evaluating a backup CDN configuration so that these regions have the same level of provider redundancy we are building toward in production.
- Repair our Slack integration. Our primary path for posting status page updates is through our Slack integration, which was misconfigured at the time of this incident. Restoring this integration to a known-good state is being prioritized so that the primary communication path is dependable.
- Make every status-page path independent and operable by more than one person. Each of our three status-page mechanisms had a single point of failure during this incident. We are restoring the Slack-based primary path, establishing an out-of-band secondary path that does not depend on the Web UI, and ensuring the tertiary engineering-tooling path can be operated by more than one engineer.
- Strengthen change management for account and integration changes. We are extending our change-management discovery process to ensure the configuration of every account and integration is explicitly verified after any change so that legacy state cannot leave a service in a non-functional condition.
Commitment to Reliability#commitment-to-reliability
We understand that you rely on our services for your most critical operations, and we know that an unavailable Web UI, even with the API and Slack/Teams integrations intact, falls short of the experience you expect from us. The fact that our synthetics caught this issue before any customer report came in is a sign that our detection investments are working, and our team's response time reflects the priority we place on Web UI availability.
We also recognize that we did not communicate publicly during the impact window. Customers visiting our status page during the incident did not see a real-time update, which is not the standard we hold ourselves to. The corrective measures above directly address that gap so that our public communication is independent of any system that could itself be impacted by an outage.
We are committed to maintaining the highest levels of service reliability and transparency. We encourage you to subscribe to our Status Page for communication during future events. The corrective measures above will help ensure we continue to meet the standards you expect from us.