How to Run Code (Safely) on Repeat Forever

Jon Sully

Jon Sully

@jon-sully

We found ourselves in a less-than-common situation: we had a chunk of code that we wanted to run nearly-constantly (at least once every couple seconds) but which should also run in a single-threaded style (subsequent runs of that logic shouldn’t overlap if any take longer to process). Let me explain...

The Use-Case

We take on a lot of traffic here at Judoscale. Almost all of it is real-time metrics for the various applications we autoscale — thousands of POSTs to our servers containing request and job queue times from our clients’ applications as they run.

An ocean of data...

Ultimately this data needs to be sorted, filed, and aggregated for it to be of any use to us. For performance reasons, our architecture for those POST-handling controllers is to, essentially, save the data quickly and process it later. More technically, we push the POST data straight into Redis and let a different piece of our architecture handle it. We want our POSTs to yield fast 200’s! (And they do — our average response time is just 10ms)

But that of course leaves the question: how do we ‘process it later’ at that scale and speed?

Well, before we get there, let’s talk about the constraints. Loosely, we have at least a thousand bundles of data to process every second. And those bundles can contain a lot of data, so processing may itself take a few moments. So first, we need this system to run often. Really often. Letting our fire-hose of data queue up is bad for memory, bad for processing times, and bad for Judoscale’s ability to scale up your application quickly. Second, given that processing this data requires non-trivial resources, we don’t want multiple copies of the processing to run at once — they might both attempt to process the same incoming data and/or step on each others’ toes. Which is wasteful and potentially bug-causing.

So, in short, we want our incoming data processing to behave like a single-threaded loop: a chunk of code that runs over and over sequentially, but which cannot step on itself since the current loop iteration must finish before the next begins. We want to fully-process all of our incoming data bits then repeat, making sure that only one actor is doing so at a time. We’re going to call this continuous code from here on out! Let’s dive in.

A Few Concepts

There are plenty of ways one could implement the continuous code concept, but we considered three. These break down into two camps: setups that make use of a background job system and setups that don’t. On the ‘use background jobs’ side, we’ll talk through continuous code via:

  • Self-re-enqueuing background jobs
  • A scheduler + global lock

And on the not-background-jobs side, we’ll cover continuous code via:

  • A forever-running Rake task

Though, it should be noted that each of these methodologies comes with its own pros and cons and, (I hear your groans) there is no single right answer for every application out there. Our goal here is to walk you through why we chose the best path for us. Yours might be different!

To Background Job or To Not Background Job

Before diving into the implementations, it’s worth talking about the tradeoffs in using your background job system for Continuous Code.

Diagram comparing work as jobs in a background process vs a standalone, looping, process

Generally, a background job system isn't built for this task. Background job systems exist to allow you to run some code asynchronously outside of a web-request thread (a ‘background job’), but the guarantees and styles of these systems are different than the Continuous Code concept. For starters, background job systems tend to be multi-process, multi-threaded, or both — you want several different job runners to handle all the various jobs that might be fired off. This can stand at odds with Continuous Code, where we need to ensure that the next pass only begins after the current pass finishes. Additionally, background jobs aren’t typically designed to loop. They’re designed to run a specific chunk of code with particular arguments, mark it completed, and move on to the next job. Overcoming these differences in design is added complexity.

On the other hand, not using a background job systems means we’re not taking advantage of infrastructure and frameworks that we already run, develop for, and support. Inevitably, if we’re not using a background job system, we’ll end up implementing some sort of new runner that requires its own processing power and upkeep. This too is added complexity.

These tradeoffs will become more apparent as we talk through the specific implementations we considered, but keep them in mind along the way.

The Self-re-enqueueing Background Job

repeating job diagram

The Idea

We build a job that does its work then immediate re-enqueues a new copy of itself back into the job queue. Something like this (a Sidekiq and Rails example):

# ~/app/jobs/cycle_job.rb

class CycleJob
  include Sidekiq::Worker
  
  def perform
    results = SomeDatabaseQuery.run
    aggregated_data = Aggregator.call(results)
    AggregatedStuff.insert(aggregated_data)
    
    CycleJob.perform_async
  end
end

(And, of course, we’d want some error handling to ensure that the job gets re-enqueued even if the current iteration errors for some reason.)

Does it satisfy the “run often” premise? Well... usually. At first glance it appears that it’ll run nearly as fast as the work can be done, but we should be careful with that assumption. It’s true that the next copy of the job gets enqueued about as fast as the work can be done, but we should remember that there’s a latency between when a job is submitted and when it’s picked up by a job-runner. We should also remember that the job is submitted to a queue. Which is to say, there could be other jobs ahead of it! But let’s assume that our application is following healthy standards for background jobs and keeps low queue times. Does it ‘run often’? Yes.

Does it satisfy the “don’t run multiple copies at a time” premise? Well... usually. (I know — frustrating, right?) Again, at first glance it appears as though one job kicking itself off again would grant us the guarantee that never more than one copy is running at a time. And that is true almost all of the time. So for now, we’ll just say yes.

The Problem(s)

I should note first that we ran this exact setup at Judoscale for years and became very aware of the following two ways this structure can fail. Generally in the middle of the night, of course 😅

The chain is broken. A job enqueuing the next copy of itself repeatedly forever is essentially a chain. Each link forges the next. If for some reason, somehow, one link doesn’t create the next, the chain stops. It has no recovery at that point. The only recourse is for us to manually create a new chain again.

Now, in all theory, this shouldn’t happen! We have error handling! We have smart tools! Yeah... and yet. No matter how far we dove down the rabbit hole, the self-enqueuing model always yielded a broken chain after some amount of time. Could it be our own incompetence? Entirely possible. But regardless, we lost confidence.

The chain is.. duplicated? Continuing our chain metaphor here (including moments of inexplicable results), we occasionally observed situations where somehow we now had two chains of the same job going. This implies that somewhere along the way, a particular chain link decided to create two links after itself instead of one. And thus, we’ve now broken the “don’t run multiple copies at a time” premise!

The Benefit(s)

The one stand-out benefit of this setup is the lack of added complexity. You don’t need any kind of third-party scheduler, you don’t need any additional processes, and you don’t need to make your architecture more complex. You simply have a job that kicks itself off again when it’s done. And it works almost all of the time. That’s a real complexity benefit! If your team and/or use-case is open to the premise that the job might just need to be manually started again once in a while, or you’re working on an early-stage proof-of-concept, this can be a real win. It’s faster to implement, easier to understand, and quicker to get running than any other option. (Just make sure you have reliable monitoring in place!)

Our Verdict

The self-re-enqueuing background job is intuitive, but the structure does come with a cost in the reliability department. Some of this comes back to an earlier point: background job systems weren’t intended to offer the guarantees we’re after, so trying to get those guarantees yields friction and fringe-cases. Should the chain ever break? No! Should the chain ever branch/duplicate? No! Do those things still happen somehow? ...yes.

Given that our use-case is a critical back-bone for our system — a chain that should never break — we decided to move away from this pattern. That said, we ran this setup for years and, while we did have chain-breaking issues, they were rare enough to run this setup for years 😉. This is likely the right setup for lots of early-stage applications and/or lesser-critical needs.

The Scheduled Background Job with a Global Lock

That’s a mouthful!

clock-based job diagram

The Idea

This one isn’t too far from the above, but instead of one instance of a job kicking off the next, we instead have a scheduler process running somewhere that’s kicking off jobs itself on some very quick schedule. Maybe something like this:

# ~/clock.rb

class Clock
  include SomeScheduleFramework
  
  every 2.seconds { CycleJob.perform_async }
  every 5.minutes { SomeOtherJob.perform_async }
end

# ~/app/jobs/cycle_job.rb

class CycleJob
  include Sidekiq::Worker
  
  def perform
    # Lock against other attempts
    return unless Rails.redis { |r| r.set "cycle-job-lock", "busy", nx: true }

    results = SomeDatabaseQuery.run
    aggregated_data = Aggregator.call(results)
    AggregatedStuff.insert(aggregated_data)
    
    # Unlock for next pass
    Rails.redis { |r| r.del redis_key }
  end
end

But there’s a bit of nuance here: we’ve introduced a new chunk of architecture, the Clock process, as well as a locking system.

The Clock process is a simple idea that we've written about before — an additional process that exists solely to kick off asynchronous jobs at various time intervals. This is generally just a Ruby implementation or clone of Cron. But it is additional overhead and is best run as a separate process when deployed in a production environment. This gives it the isolation and consistency needed for reliability. And separate clock processes, like Cron, are indeed very reliable.

The second piece of nuance here is the locking. The code above shows an example of how one might accomplish this locking using Redis, but any style or implementation of pessimistic locking should work just the same. The idea is simply that, while the Clock should only kick off the job once every two seconds, it’s possible that one copy of the job may run for longer (for reasons unknown). If that were the case, the next copy of the job would realize the lock is still checked out and skip itself.

Does it satisfy the “run often” premise? Absolutely. And the clarity in the syntax is refreshing. As it reads in plain English, “every 2 seconds do the thing” is extremely easy to grok.

Does it satisfy the “don’t run multiple copies at a time” premise? Yes, we can guarantee that multiple copies of the job won’t run concurrently with our pessimistic locking approach. Technically this doesn’t mean that another loop of the code will begin as soon as the former finishes, but since we’re running quickly enough (every 2 seconds in our example), that should be fine.

The Problem(s)

This approach is pretty safe; it shouldn’t have ‘problems’ in the sense that the code may stop running and/or need manual intervention. Clock processes (as with Cron) are regarded as extremely reliable, as are pessimistic locking systems through tools like Redis, MySQL, or Postgres.

That said, there are a couple of concerns worth keeping in mind with this approach. The first is that you’ll need to spin up more infrastructure — an additional process that will run indefinitely. While often minimal, this is still cost and complexity. The second is the general mixing of patterns. If you’re going to run a Clock process, you’ll probably want to do that for all of your scheduled background jobs. Mixing a Clock process for just your continuous code with another scheduling system for your other background jobs is likely not worth the mental complexity.

On the other side of that coin, if you’re already running a Clock process for your scheduled background jobs, this setup may be an easy add-on for your application!

The Benefit(s)

While this approach brings a bit of added complexity compared to others, the biggest benefit is definitely its reliability. As I’ve mentioned a couple times, Clock processes in Ruby (there are a few gems that do this) are essentially clones of the venerable Linux OS-level scheduler, Cron. Both Cron and its Ruby counterparts are tremendously reliable. They just work.

Additionally, this approach requires us to implement pessimistic locking to guarantee single-threaded-style execution. This too is added complexity, but once again, extremely reliable. Whether you opt to build the locking using Redis, MySQL, Postgres, or Sqlite, all of these tools will handle our second-by-second locks with ease.

So, while this approach may have one or two layers more than other Continuous Code implementations, it remains the most reliable overall. And, to us, that means the most peaceful.

Our Verdict

Compared to the self-enqueueing job approach, the scheduled background job (+global lock) approach is harder to grok and it requires a bit more tooling. But it’s worth it. The added framework and tooling all work to support a high level of reliability and consistency — both of these yielding peace for a development team. The entire premise of Continuous Code is implicitly built upon the “it should just work” mindset, and this approach just works.

This is ultimately the pattern we moved toward with Judoscale. We haven’t seen a single issue since. It’s actually been really great.

The Forever-Running Rake Task

forever-running task diagram

The Idea

While it’s not the approach we ended up going with, it’s the alternative we almost went with. The idea is essentially to not use a background job system at all. We said at the beginning of this article:

in short, we want our incoming data processing to behave like a single-threaded loop

So... why not try exactly that? If we encode our continuous-code into a Rake task that simply loops forever, that would be a single-threaded loop. That might look like this:

# ~/lib/tasks/continuous.rb

namespace :continuous do
  task aggregate: :environment do
    loop do
      results = SomeDatabaseQuery.run
      aggregated_data = Aggregator.call(results)
      AggregatedStuff.insert(aggregated_data)
    end
  end
end

Then we can spin up a new process in our production environment that just runs that task: bundle exec rails continuous:aggregate and voilà!

Does it satisfy the “run often” premise? As often as you prefer! By default it will run as fast as the steps take to complete, but we could also add a sleep call to slow it down as preferred.

Does it satisfy the “don’t run multiple copies at a time” premise? Yup! By constraining all of our code into loop, we benefit from the guarantees of a single-threaded loop — each iteration of a loop must complete before the next iteration begins.

The Problem(s)

Like the prior approach, this one is pretty safe too. We shouldn’t have problems with this approach that ultimately impact whether or not the code is executed. Just a few concerns around why you might, or might not, choose this approach.

The first of those concerns is simply inefficiency. It’s very likely that you’ll want to give this task a speed-limit — a sleep call that spaces out each iteration of the loop just a bit (be it 250ms, 1 second, or 5 seconds). That said, the time spent sleeping is ultimately wasted processor cycles. That’s wasted money. That’s resource hosting costs we’re paying for but not actually using! Ultimately that may not be very much money and/or may not be much sleeping depending on how you balance the code, but it is something we wanted to point out.

The second concern with this pattern is that it doesn't scale very well. And I don’t mean scaling in the sense of running multiple dynos/services for this process (don’t do that!) but instead scaling with the number of Continuous Code jobs your application needs to run. Judoscale has four or five different chunks of code that need to run continuously. If we took this approach, we’d need to spin up four or five new processes, run the four or five Rake tasks that house those codes, and deal with the inefficiency costs multiplied by four or five. For applications with only one chunk of code that needs to run continuously, this approach is probably great! We just didn’t want to spin up all those processes.

The last concern with this pattern is that it doesn’t use a background job system to do the work. And that can be a pro at times too, rather than a con. But, like most things, it can be both. The reason it falls on the con/concern side is that it expands the mental footprint of the app. When you think of “oh the work is being done in the background”, now you have to retain and recall that there are two totally separate systems and means by which the work may be getting done ‘in the background’. It’s just more mental overhead.

The Benefit(s)

Like the self-re-enqueueing background job approach, this approach has the benefit of simplicity. Maybe even more-so since grokking this approach skips background job systems entirely — this is a single loop of code that runs forever outside of any background jobs. It’s intuitive, runs just like you’d run the same code in a local development console, and reads simply.

Additionally, this approach gains all of the benefits inherent to a Rake task. When using a hosting service like Heroku, if the task were to ever fail and crash, Heroku will automatically restart it. That’s great! Many platforms are well-prepared for these sorts of long-running tasks and using them to implement Continuous Code gives us those benefits out of the box.

Our Verdict

We really liked this approach for many aspects, but ultimately decided against it. While the simplicity and clarity brought by the syntax and execution style felt really nice, we didn’t love spinning up so many processes in our apps. Judoscale just has too many chunks of code that need to each run continuously for this approach to still feel simple.

That said, we do recommend experimenting with this approach if it might fit your application’s needs. It wasn’t the right fit for us, but it might be for you!

Wrap it Up

So there you have it. Three unique approaches to implementing the Continuous Code concept, each with their own particular tradeoffs, costs, and benefits. Hopefully this breakdown gave you some insights and/or questions to think through for your own codebase.