Service Status Update: December 4, 2024

Summary#summary

On December 4, 2024 at 16:16 UTC, we experienced an outage that caused delays in our event processing system related to Incident Management. This resulted in slower than normal processing times for some customer requests and degraded performance across our web application, Slack, and API for our Incident Management product. The incident has been fully resolved, and all systems are operating normally. We sincerely apologize for any disruption this may have caused to your operations.

Signals notification delivery was not affected.

What Happened#what-happened

Our monitoring systems detected unusual delays in our event processing pipeline. Investigation revealed that an automated workflow system had entered an infinite loop with an integration’s automation rules, causing a buildup of events that needed to be processed. This led to increased processing times for normal system operations and degraded site performance.

Customer Impact#customer-impact

Customers experienced severe delays for incident declaration and updates for 184 minutes
System response times were much slower than usual, causing Slack update timeouts
Message delivery was delayed for incident-related Slack messages an additional 98 minutes

No customer data was lost or compromised during this incident.

All Signals notifications were delivered as expected with no delays.

Resolution#resolution

Our engineering team took immediate action to:

Identify the source of the infinite loop and disconnect the affected services
Clear the backlog of pending event requests
Process the backlog of Slack update requests
Restore normal processing speeds

All systems were returned to normal operation by 19:12 UTC.

Prevention Measures#prevention-measures

We take service reliability very seriously. To prevent similar incidents in the future, we are:

Implementing additional safeguards in our automation systems to notify us sooner of backlogged queues
Improving circuit breakers on event queues to allow for faster recovery on backlogged queues
Implementing rate limiting on automation steps and improved notifications for users if an infinite loop is detected
Improve our Slack notification retry logic to prevent a thundering herd issue

Commitment to Reliability#commitment-to-reliability

We understand that you rely on our services for your critical operations. We are committed to maintaining the highest levels of service reliability and transparency, and encourage you to subscribe to our Status Page for timely communication. The measures we're implementing will help ensure we continue to meet the standards you expect from us.

Questions or Concerns?#questions-or-concerns

If you experienced any issues during this incident that have not been resolved, please don't hesitate to contact our support team. We're here to help and will be happy to provide additional information or assistance.

Thank you for your understanding and continued trust in our service as we continue to rise to the standards we expect from ourselves.

Sincerely,

The FireHydrant Team