Sidekiq to Temporal: a zero-downtime migration strategy

A technical diagram of the strangler fig design pattern for migrating active workflows

Scaling always presents good problems to solve. Runbooks are the heart of our incident management platform and the automation they provide allow customers to focus on fighting fires faster. As FireHydrant grew, our runbooks automation evolved from simple background jobs into complex, stateful orchestration. The increasingly sophisticated tools resonated with our customers: monthly average users increased by over 50% in 2025 alone. We’ve been huge fans of Temporal for a while, so it was an easy decision to lean more into the technology.

Migrating a live, highly-available system is no easy task. We used a method called Wrap, Refactor, Reroute, Retire (pronounced "whirr") to ship the changes safely — delivering a 99.995% success rate in the first 24 hours after go-live, a rate that has only improved since. Here's why we chose Temporal, and how WRRR de-risked the migration.

Why Migrations are Hard#why-migrations-are-hard

FireHydrant was originally built on Sidekiq, the gold standard for Ruby on Rails job queues. As Runbooks grew to support multi-step logic, conditional branching, and third-party API integrations, we needed more robust patterns for error handling, retries, and stateful orchestration. After successfully deploying Temporal in our Signals product (see case study), migrating Runbooks was an easy decision — just not an easy task.

Framework migrations carry two distinct risks. The first is correctness: every system change is an opportunity for error. A faulty cutover could drop live incidents at exactly the wrong moment. "Big bang" rewrites are never a great idea with high-availability systems since all changes must be simultaneously correct when they ship together. The second risk is incompleteness: a stalled migration can leave the system in a more complex state than before. Intermediate states have a way of becoming permanent as priorities shift, and incomplete migrations don't just add technical debt: they multiply it.

A well-architected migration minimizes both risks. Changes should be small, independently deployable, and easily reversible. So even if the migration stalls, the team isn't left maintaining duplicated code or debugging customer issues across two systems at once.

A diagram of sidekiq to intermediate state to temporal

The Zero-Downtime Migration Playbook#the-zero-downtime-migration-playbook

With those risks in mind, we turned a high-risk rewrite into a safe, repeatable process: Wrap, Refactor, Reroute, Retire.

1. Wrap: Enclose the existing implementation inside a CompatibilityContext. No logic changes here - this just introduces the new context boundary for future use.

2. Refactor: Break the protected logic into Temporal workflows and activities, with each API call becoming its own activity. At this stage, activities still run synchronously on Sidekiq, but the code is now structured for Temporal. Each step can be deployed and verified in isolation, without changing runtimes.

3. Reroute: Shift all new jobs to Temporal by updating the call sites that enqueue Sidekiq jobs. Existing Sidekiq jobs continue to process safely through the wrapper, so there's no hard cutover.

4. Retire: Once the legacy Sidekiq queues drain, delete the Sidekiq job. The workflow now runs entirely in Temporal.

Because refactoring the code and rerouting the infrastructure are decoupled, every step is independently deployable and easily reversible. A job can be shimmed into Temporal one API call at a time, making the final Temporal launch a non-event rather than a moment of risk.

Why Temporal?#why-temporal

Let’s start with a fundamental requirement for incident management: reliability. Incidents are high-stress situations. The last thing anybody wants during an incident is an incident with the incident management – an incident².

Here’s a real-world reliability problem: a third-party API is temporarily overloaded or offline. The runbook needs to retry that API call until it succeeds. A rate-limiting event can last 30 seconds. The straightforward approach–busy-looping until success–locks up workers to hurry up and wait. Locking the workers also has bad downstream effects like gating software changes on empty job queues.

Resilience at scale means saving progress, exiting gracefully, and reloading state on the next attempt. And how hard could it be: what’s a little state management among friends? How about Sidekiq’s recommended best practice: idempotency. Database transactions address many needs but third-party tools operate outside the transaction.

Even a simple workflow like paging responders poses several challenges. We might need to call escalation policy and team schedule APIs. We shouldn’t repeat these on retry, both to protect system load and to not repeat points of failure. And the last thing we want is to accidentally page teams multiple times during an already stressful incident and run into issues managing the states of already-ack’d pages.

A diagram of non-linear job callbacks and how it gets complicated

As more steps are strung together into more workflows, the result is a non-linear web of job callbacks with increasingly large parameter bundles. Complexity here impacts the system in every respect: development, debugging, operation, and planning.

Temporal solves these issues out of the box. The workflow history provides execution state and, crucially, linearizes the apparent control flow. Engineers can write straightforward code to perform an action for a set of inputs and Temporal tracks state across retries. Context exists within the workflow definition, not scattered across callbacks.

This new architecture was a big win for our technical stack. Solving this necessary complexity of distributed systems boosts velocity for the engineering team. More importantly, it translates directly to customer value: more reliability in the face of transient failures, providing peace of mind when it’s needed most.

Now let’s talk about actually performing the migration.

High-availability versus duplication#high-availability-versus-duplication

Some duplication is necessary to migrate highly-available software. To execute a zero-downtime migration, the legacy & target runtimes must coexist until (1) no more legacy jobs will be enqueued, and (2) all legacy jobs have been processed.

Hence the logic of duplicating code: the easiest way to support jobs in both runtimes is to copy & adapt the job into the new runtime. Consider a job that iterates through a set of keys and makes an API call for each one:

class TheLegacyJob < BaseJob
  def execute(id)
    keys = get_keys(id)
    keys.each { |k| do_the_thing(k) }
  end
end

def do_the_thing(k)
  api_client.call(k)
end

This seemingly simple job is complex and its implementation presents some issues. What happens if the API call fails partway through the keys? When we retry, how do we avoid calling the external API again for already-processed keys? All these questions are why state management is hard. Sidekiq batches & callbacks help, at the cost of complex non-linearity.

This is where Temporal shines: it's so easy to implement idempotent asynchronous jobs. The Temporal equivalent would be:

class TheWorkflow < BaseWorkflow
  def execute(id)
    keys = workflow.side_effect { get_keys(id) }
    keys.each { |k| TheActivity.execute(k) }.map(&:wait)
  end
end

class TheActivity < BaseActivity
  def execute(k)
    api_client.call(k)
  end
end

Even in this simple example, it’s not obvious how to share much code across these two implementations. After all, a Sidekiq job has no concept of waiting on an activity execution. We’ve ended up with duplication like this:

This idealized example makes it pretty easy to see that the Sidekiq job and the Temporal workflow do the same thing: get the keys from an input id, then make an API call for each. But in practice, these code paths are rarely so simple. Business logic typically varies by configuration; filters data based on setting & permissions; and much more.

Duplicating so much logic puts us at risk of intermediate states becoming permanent.

If the migration stalls, the codebase is left straddling two systems. This makes the system more complex and harder to maintain than before we started. Maintaining two code paths for the same logic is a recipe for problems. It also makes it harder to validate changes, making future code changes more complex (again: technical debt). This is particularly expensive for AI-powered platforms where automated validation is key to safe velocity.

But the Temporal paradigm is simply different from Sidekiq. So how can we not duplicate code? How can a Sidekiq job run code that uses the Temporal runtime? The answer is: don’t – use a fake, synchronous Temporal runtime instead.

Bridging the Gap: The Compatibility Context#bridging-the-gap-the-compatibility-context

The ‘Strangler Fig’ pattern is a popular approach to modernize a codebase by growing the new system around the legacy system. But we specifically did not want mission-critical operations split across systems. We also couldn’t commit to rewriting–and revalidating–each runbook step before closing the intermediate state.

Our solution was to migrate the system inside out, bridging the gap between runtimes until it became small and easily verifiable. We needed to prepare Sidekiq jobs for Temporal constructs: putting remote calls behind activities, but not actually launching them. The key is to run the “workflow” code in-place, synchronously, within the local runtime (and without any of the Temporal state management).

How? Enter the AbstractTemporal::CompatibilityContext.

The compatibility context is a migration shim and testing utility that runs code designed for the Temporal runtime in non-Temporal environments. Instead of dispatching commands through the Temporal server and relying on its state management, workflows and activities are executed immediately within the current process.

With this shim in place, TheLegacyJob transforms from a duplicated orchestration into a tiny wrapper:

class TheLegacyJob < BaseJob
  def perform(id)
    AbstractTemporal::CompatibilityContext.activate do 
      TheWorkflow.execute(id)
    end
  end
end

Now, we have a single source of truth for our core logic:

A diagram of the single source of truth for logic

During the transition period, we can be confident the logic is the same across runtimes because indeed, the same TheWorkflow code is running in both environments.

This lets us migrate the Sidekiq job “in place” with Temporal constructs, but execute them synchronously until we are ready to route traffic to the Temporal servers. At that point, the workflow that was running within the compatibility context will run in a real Temporal context that launches & waits for asynchronous function calls.

Under the Hood: How the Ruby Context Works#under-the-hood-how-the-ruby-context-works

To understand how the compatibility context works, let’s look at the Temporal SDK. We adopted Temporal before the official Ruby SDK so we’re using the Coinbase Temporal Ruby SDK.

The SDK defines a Temporal activity, and its included workflow convenience methods define the class method execute like so: [src]

def execute(*input, **args)
  context = Temporal::ThreadLocalContext.get
  raise 'Called Activity#execute outside of a Workflow context' unless context

  context.execute_activity(self, *input, **args)
end

The SDK uses Thread.current to determine the current Temporal runtime. In a real Temporal environment, those context methods are quite complex; they communicate with the Temporal server, read from the workflow history, etc.

The compatibility context implements the same interface as the regular context. The activate block modifies the Thread.current's local data to insert our custom compatibility context. Rather than interface with the Temporal runtime, operations like TheWorkflow.execute are processed inline. Next, we support asynchronous behavior, like waiting for a fan-out to complete:

futures = [1, 2, 3].map { |i| AnActivity.execute(i) }
results = futures.map(&:wait)

In a real Temporal runtime, execute returns a Future, schedules the future activity, then wait pauses the worker until the result is available. In the compatibility runtime, AnActivity.execute runs immediately inline, and yields a mocked future that already contains the resolved result. Calling .wait on it simply returns that result.

This mechanism ensures that the synchronous-looking, linearized Temporal workflows execute the same logic whether they’re backed by the actual Temporal runtime or executing synchronously inside a legacy Sidekiq process. There are some caveats: the method isn’t meant to durably provide both runtimes, and applying it is more complex when the Sidekiq job has callbacks.

Conclusion#conclusion

As systems scale, refining system architecture is less luxury and more necessity. Simple, linear background jobs quickly evolve into complex workflows. Next thing you know, you’re solving distributed runtime state management and building an asynchronous idempotency system just to handle an API’s rate limits.

Temporal solves this by linearizing the control flow with a durable workflow state engine. This makes resilience an easy feature rather than an advanced architectural pattern, enabling more focus on product features instead of systems logistics.

Getting to that future state requires navigating the risky, intermediate migration period. By investing in tooling like the Compatibility Context, we derisked the cutover with small incremental changes to avoid unwieldy duplication. We bridged the gap between correctness and architectural simplicity, avoiding both errors and technical debt.

The resulting migration delivered a solid 99.995% success rate during its first 24 hours and improved runtime visibility lets us identify issue causes more quickly. It’s a valuable reminder about system migrations at scale: delivering an iterative transition is more effective than risking the complex results of an abandoned rewrite.