Scaling background jobs

Most systems often have recurring background jobs that execute at regular intervals to perform database maintenance tasks, importing data from other systems, caching data, newsletters, etc. When you have a multi tenant platform, you usually have these recurring jobs per tenant, which sometimes means hundreds or thousands of concurrent jobs. Some jobs may even be long-running jobs, while others could be very cpu intensive jobs.

Lots of open source schedulers allow you to configure (dynamically or not) when your jobs run, but often they have to run in-proc, as in, in the scheduler’s process by implementing an IJob interface to execute the job’s code.

public interface IJob
    Task Execute(JobRequest job);

This is required so that the scheduler can measure total execution time and ensure the interval between executions is respected, but also to know whether the job has completed or failed in order to retry it. Scaling schedulers often requires distributed locks to synchronize executions, splitting jobs amongst multiple scheduler instances, while at the same time, executing the actual jobs. When you have thousands of concurrent jobs this is a serious problem, as you often end up with hundreds of threads in your scheduler processes and/or massive cpu/memory usage causing a lot of strain in the scheduler with some jobs affecting the stability of others. This happens because you’re mixing two different things: scheduling vs processing of the jobs.

In order to achieve higher scalability you need to offload jobs and run them outside the scheduler. The most common solution is to use message queues, where you tipically only need to worry about scaling your message handlers by simply adding more machines/processes. If your queue becomes a bottle neck you can check whether your queuing technology supports partitioning, and if it doesn’t you can use multiple queues (perhaps one per tenant or per group of tenants).

The scheduler still needs to be notified about job completions in order to be able to ensure proper intervals and avoid concurrent executions of the same job. If your scheduler has an API where you can report job completion status from the message handlers, great. Otherwise, don’t despair.. it’s relatively easy to implement one using messaging and using a jobId as correlation, and have your scheduler’s IJob await until the actual job completes.

Continue reading