Service Status Update: October 20, 2025
On October 20, 2025 at 13:10:56 UTC, we experienced widespread integration failures affecting multiple FireHydrant services due to an AWS us-east-1 outage.
On October 20, 2025 at 13:10:56 UTC, we experienced widespread integration failures affecting multiple FireHydrant services due to an AWS us-east-1 outage that began at 07:11:00 UTC. The incident cascaded through numerous third-party providers, impacting SMS, Voice, and WhatsApp notification delivery, incident management workflows, and authentication services.
Critical functionality including Signals UI and API, multi-org SSO authentication, and feature flag evaluation were disrupted when LaunchDarkly became unreachable at 19:20 UTC. Emergency mitigations were deployed immediately to hardcode critical feature flags and restore service availability.
Core incident declaration and management capabilities were preserved. Automated paging and push notification delivery for Signals remained operational throughout the incident. The incident was fully resolved at 13:09:43 UTC on October 21, 2025.
What Happened#what-happened
At 07:11:00 UTC, AWS began experiencing DNS resolution failures in the us-east-1 region, initially affecting DynamoDB and rapidly cascading to multiple AWS services. FireHydrant's monitoring detected service impacts at 13:10:56 UTC when multiple integration partners began failing simultaneously.
The AWS outage created a domino effect across our integration ecosystem:
- Atlassian services (Jira, Confluence, OpsGenie) became unreachable
- PagerDuty experienced elevated API errors and latencies
- Recall.ai services failed, preventing Scribe from joining meetings
- SendGrid email delivery was disrupted
- Slack integration functionality degraded
- Twilio Voice/SMS/WhatsApp delivery failed completely
- Zoom services functionality degraded
Additional downstream effects severely impacted our ability to deploy fixes quickly:
- CircleCI became unreliable
At 19:08 UTC, a critical secondary failure occurred when LaunchDarkly's feature flag service became unreachable due to cache expiration. This caused feature flags to evaluate to false by default, resulting in:
- Signals UI and API becoming inaccessible
- Custom RBAC roles becoming unreachable and defaulting to
DENY
- Multi-org SSO authentication defaulting to
DENY
for one enterprise customer
The engineering team immediately deployed emergency patches to hardcode critical feature flags to appropriate default values, bypassing LaunchDarkly's unavailability and restoring core functionality.
Customer Impact#customer-impact
Several customers were significantly impacted:
- Several customers reported inability to manually page through Signals
- One customer experienced SSO authentication failures for 1 hour and 26 minutes (19:08-20:34 UTC)
- Multiple customers experienced runbook execution failures for steps dependent on affected integrations
Specific service impacts included:
- SMS and WhatsApp alert delivery severely degraded for approximately 8 minutes (13:18-13:26 UTC)
- Scribe unable to join Zoom or Google Meet calls due to Slack, Zoom, and Recall.ai failures (06:49-19:43 UTC)
- Runbook steps dependent on Jira, Confluence, OpsGenie, PagerDuty, and Slack failed or experienced significant delays
- Feature flag dependent UI components became inconsistently available
- AI summary generation experienced timeouts in Slack due to synchronous processing
- Manual paging via the UI and Slack was not available for approximately 1 hour and 14 minutes (19:08-20:22 UTC)
No customer data was lost or compromised during this incident. Users were able to declare and manage incidents on Slack and MS Teams with some delay due to integration dependencies throughout the incident.
All Signals notifications for alert ingestion continued to function normally.
Technical Timeline#technical-timeline
October 20, 2025#october-20-2025
07:11:00 UTC - AWS us-east-1 begins experiencing DNS resolution issues affecting DynamoDB
07:26:41 UTC - AWS confirms significant error rates across multiple services
07:51:09 UTC - AWS identifies DNS as root cause, begins mitigation
09:27:33 UTC - AWS reports significant recovery signs
10:35:37 UTC - AWS declares DNS issue fully mitigated, services recovering
13:10:56 UTC - FireHydrant incident declared, initial reports of paging failures
13:15-13:51 UTC - Systematic validation of all integration partners completed
- Recovered: Zoom, Asana, BugSnag, Checkly, GitHub, Honeycomb, Linear, Shortcut, Slack
- Still impacted: PagerDuty, ZenDesk, Jira, OpsGenie, Confluence
13:26-13:42 UTC - Siren delivery latency SLO breach confirmed for 8-minute window
14:01-14:50 UTC - Root causes identified: Twilio (SMS/WhatsApp) and Recall.ai outages
16:34-16:53 UTC - Manual Temporal failover triggered to us-east-2
19:20 UTC - LaunchDarkly outage begins impacting feature flags
19:58-20:18 UTC - Emergency patches deployed for multi-org SSO
20:22 UTC - Feature flag manifest updated with hardcoded defaults
22:47-23:07 UTC - All patches deployed and incident is mitigated, monitoring continues
October 21, 2025#october-21-2025
13:09:43 UTC - Incident resolved after confirmation of full upstream recovery
Resolution#resolution
The engineering team implemented multiple emergency mitigations:
- Feature Flag Hardcoding: Critical feature flags were hardcoded in multiple pull requests to restore baseline functionality
- SSO Authentication Fix: Hardcoded account IDs for multi-org SSO to restore login capability
- Asynchronous Processing: Updated AI summary generation to run asynchronously, preventing Slack timeouts
- Web UI Resilience: Added backup fetching mechanisms with 15-second timeout for summary availability
- Temporal Failover: Manually triggered failover from us-east-1 to us-east-2 for workflow orchestration to maintain alert delivery pipeline
Next Steps#next-steps
We are implementing several improvements to prevent similar incidents:
- Feature Flag Resilience: Updating all production feature flags to default to safe values when LaunchDarkly is unavailable
- Asynchronous Workflows: Converting all non-critical synchronous operations to asynchronous patterns
- Integration Health Monitoring: Enhanced monitoring for third-party service dependencies and updated status page components for end users to quickly evaluate integration health
- Documentation Updates: Engineering runbooks updated with emergency mitigation procedures
Commitment to Reliability#commitment-to-reliability
We understand that you rely on our services for your critical operations. We are committed to maintaining the highest levels of service reliability and transparency, and encourage you to subscribe to our Status Page for timely communication. The measures we're implementing will help ensure we continue to meet the standards you expect from us.
Questions or Concerns?#questions-or-concerns
If you experienced any issues during this incident that have not been resolved, please don't hesitate to contact our support team. We're here to help and will be happy to provide additional information or assistance.
Thank you for your understanding and continued trust in our service as we continue to rise to the standards we expect from ourselves.
Sincerely,
The FireHydrant Team