Anytime a worker dyno is shutdown—due to autoscaling, a deploy, or a daily restart—background jobs are at risk of being terminated.
Heroku and job backends like Sidekiq work together to gracefully handle this in most cases. When Heroku shuts down a dyno, processes are given 30 seconds to shut down cleanly. During this shutdown period, Sidekiq stops accepting new work and allows jobs 25 seconds to complete before being forcefully terminated and re-enqueued.
Sidekiq also recommends that our jobs are idempotent and transactional so that if they are prematurely terminated, they can safely re-run. This is good advice for all job backends on Heroku since Heroku can reboot your dynos at any time.
If we're following these best practices, we'll have no issues with long-running jobs and autoscaling worker dynos. Our apps are imperfect, though, so we may find ourselves with long-running jobs that cannot be safely terminated and re-run.
Rails Autoscale provides a mechanism to avoid downscaling your worker dynos if any jobs are currently running. To enable this option, ensure you're on the latest gem version, and set the following config var for your app:
heroku config:add RAILS_AUTOSCALE_LONG_JOBS=true
This tells the Rails Autoscale agent to report the number of "busy" workers (actively running jobs) for each queue.
Once these metrics are being collected, you'll see a new advanced setting in Rails Autoscale:
Check this option, and you're good to go! Rails Autoscale will suppress downscaling if there are any busy workers (running jobs) for the relevant queues.
Be careful, though. If you have fairly constant job activity, your workers will never have a chance to downscale. 😬 This feature is intended for queues with sporadic, long-running jobs.
You must be running version 0.10.1 or higher to use this feature.
gem 'rails_autoscale_agent', '>= 0.10.1'
As of this writing, only Sidekiq and Delayed Job support this feature. Email me if you need this feature for Resque or Que.
This optional configuration mitigates the issues of automatic downscaling killing long-running jobs, but be aware that long-running jobs are still an issue. Deploys and restarts will still potentially terminate your long-running jobs, so if you're able, you should break up your large jobs into batches of small jobs.