Service Status Update: October 20, 2025

On October 20, 2025 at 13:10:56 UTC, we experienced widespread integration failures affecting multiple FireHydrant services due to an AWS us-east-1 outage that began at 07:11:00 UTC. The incident cascaded through numerous third-party providers, impacting SMS, Voice, and WhatsApp notification delivery, incident management workflows, and authentication services.

Critical functionality including Signals UI and API, multi-org SSO authentication, and feature flag evaluation were disrupted when LaunchDarkly became unreachable at 19:20 UTC. Emergency mitigations were deployed immediately to hardcode critical feature flags and restore service availability.

Core incident declaration and management capabilities were preserved. Automated paging and push notification delivery for Signals remained operational throughout the incident. The incident was fully resolved at 13:09:43 UTC on October 21, 2025.

What Happened#what-happened

At 07:11:00 UTC, AWS began experiencing DNS resolution failures in the us-east-1 region, initially affecting DynamoDB and rapidly cascading to multiple AWS services. FireHydrant's monitoring detected service impacts at 13:10:56 UTC when multiple integration partners began failing simultaneously.

The AWS outage created a domino effect across our integration ecosystem:

Atlassian services (Jira, Confluence, OpsGenie) became unreachable
PagerDuty experienced elevated API errors and latencies
Recall.ai services failed, preventing Scribe from joining meetings
SendGrid email delivery was disrupted
Slack integration functionality degraded
Twilio Voice/SMS/WhatsApp delivery failed completely
Zoom services functionality degraded

Additional downstream effects severely impacted our ability to deploy fixes quickly:

CircleCI became unreliable

At 19:08 UTC, a critical secondary failure occurred when LaunchDarkly's feature flag service became unreachable due to cache expiration. This caused feature flags to evaluate to false by default, resulting in:

Signals UI and API becoming inaccessible
Custom RBAC roles becoming unreachable and defaulting to DENY
Multi-org SSO authentication defaulting to DENY for one enterprise customer

The engineering team immediately deployed emergency patches to hardcode critical feature flags to appropriate default values, bypassing LaunchDarkly's unavailability and restoring core functionality.

Customer Impact#customer-impact

Several customers were significantly impacted:

Several customers reported inability to manually page through Signals
One customer experienced SSO authentication failures for 1 hour and 26 minutes (19:08-20:34 UTC)
Multiple customers experienced runbook execution failures for steps dependent on affected integrations

Specific service impacts included:

SMS and WhatsApp alert delivery severely degraded for approximately 8 minutes (13:18-13:26 UTC)
Scribe unable to join Zoom or Google Meet calls due to Slack, Zoom, and Recall.ai failures (06:49-19:43 UTC)
Runbook steps dependent on Jira, Confluence, OpsGenie, PagerDuty, and Slack failed or experienced significant delays
Feature flag dependent UI components became inconsistently available
AI summary generation experienced timeouts in Slack due to synchronous processing
Manual paging via the UI and Slack was not available for approximately 1 hour and 14 minutes (19:08-20:22 UTC)

No customer data was lost or compromised during this incident. Users were able to declare and manage incidents on Slack and MS Teams with some delay due to integration dependencies throughout the incident.

All Signals notifications for alert ingestion continued to function normally.

Technical Timeline#technical-timeline

October 20, 2025#october-20-2025

07:11:00 UTC - AWS us-east-1 begins experiencing DNS resolution issues affecting DynamoDB

07:26:41 UTC - AWS confirms significant error rates across multiple services

07:51:09 UTC - AWS identifies DNS as root cause, begins mitigation

09:27:33 UTC - AWS reports significant recovery signs

10:35:37 UTC - AWS declares DNS issue fully mitigated, services recovering

13:10:56 UTC - FireHydrant incident declared, initial reports of paging failures

13:15-13:51 UTC - Systematic validation of all integration partners completed

Recovered: Zoom, Asana, BugSnag, Checkly, GitHub, Honeycomb, Linear, Shortcut, Slack
Still impacted: PagerDuty, ZenDesk, Jira, OpsGenie, Confluence

13:26-13:42 UTC - Siren delivery latency SLO breach confirmed for 8-minute window

14:01-14:50 UTC - Root causes identified: Twilio (SMS/WhatsApp) and Recall.ai outages

16:34-16:53 UTC - Manual Temporal failover triggered to us-east-2

19:20 UTC - LaunchDarkly outage begins impacting feature flags

19:58-20:18 UTC - Emergency patches deployed for multi-org SSO

20:22 UTC - Feature flag manifest updated with hardcoded defaults

22:47-23:07 UTC - All patches deployed and incident is mitigated, monitoring continues

October 21, 2025#october-21-2025

13:09:43 UTC - Incident resolved after confirmation of full upstream recovery

Resolution#resolution

The engineering team implemented multiple emergency mitigations:

Feature Flag Hardcoding: Critical feature flags were hardcoded in multiple pull requests to restore baseline functionality
SSO Authentication Fix: Hardcoded account IDs for multi-org SSO to restore login capability
Asynchronous Processing: Updated AI summary generation to run asynchronously, preventing Slack timeouts
Web UI Resilience: Added backup fetching mechanisms with 15-second timeout for summary availability
Temporal Failover: Manually triggered failover from us-east-1 to us-east-2 for workflow orchestration to maintain alert delivery pipeline

Next Steps#next-steps

We are implementing several improvements to prevent similar incidents:

Feature Flag Resilience: Updating all production feature flags to default to safe values when LaunchDarkly is unavailable
Asynchronous Workflows: Converting all non-critical synchronous operations to asynchronous patterns
Integration Health Monitoring: Enhanced monitoring for third-party service dependencies and updated status page components for end users to quickly evaluate integration health
Documentation Updates: Engineering runbooks updated with emergency mitigation procedures

Commitment to Reliability#commitment-to-reliability

We understand that you rely on our services for your critical operations. We are committed to maintaining the highest levels of service reliability and transparency, and encourage you to subscribe to our Status Page for timely communication. The measures we're implementing will help ensure we continue to meet the standards you expect from us.

Questions or Concerns?#questions-or-concerns

If you experienced any issues during this incident that have not been resolved, please don't hesitate to contact our support team. We're here to help and will be happy to provide additional information or assistance.

Thank you for your understanding and continued trust in our service as we continue to rise to the standards we expect from ourselves.

Sincerely,

The FireHydrant Team