Page MenuHomePhabricator

Distributed cron replacement
Closed, ResolvedPublic

Description

A common requirement in infrastructure maintenance is the ability to execute tasks at scheduled times and intervals. On Unix systems (and, by extension, Linux) this is traditionally handled by a cron daemon. Traditional crons, however, run on a single server and are therefore unscalable and create single points of failure. While there are a few open source alternatives to cron that provide for distributed scheduling, they either depend on a specific "cloud" management system or on other complex external dependencies; or are not generally compatible with cron.

The Wikimedia Labs has a need for a scheduler that:

  • Is configurable by traditional crontabs;
  • Can run on more than one server, distributing execution between them; and
  • Guarantees that scheduled events execute as long as at least one server is operational.

The ideal distributed cron replacement would have as few external dependencies as possible and be general enough that it can be usable by other infrastructures (that is, does not rely on any specificity of the Wikimedia Labs).


Version: unspecified
Severity: enhancement
OS: Linux

Details

Reference
bz57613

Event Timeline

bzimport raised the priority of this task from to Needs Triage.Nov 22 2014, 2:21 AM
bzimport set Reference to bz57613.
bzimport added a subscriber: Unknown Object (MLST).

Facebook Open Academy program [1] is interested in this project. We need to decide whether we commit to mentor it, and how many students with which skills are we looking for.

Note that this isn't a full-time program like Google Summer of Code. Each student will be working on this project about 5-8 a week between February and June approximately, as if it would a university course.

[1] http://lists.wikimedia.org/pipermail/wikitech-l/2013-November/073226.html

The specification for this needs a bit more. Some questions:

  • How distributed? GitHub is "distributed", but at the same time centralized. Do we just want the task execution to be distributed, or will the entire system be distributed? IMO the best solution will most likely be something that's completely distributed.
  • Where are tasks consumed from? The description says it needs to be configurable by traditional crontabs. Does this mean a single crontab on one computer is enough to schedule a task for the system? Or should it be entered on every crontab on every computer? Neither of these are the best options, and I don't think crontab configuration in such an environment is the most practical idea.
  • Are the tasks themselves distributed? Or will a task be isolated to a single machine? With most tasks I'd imagine the latter.

If we assume the sort of answers I gave to the questions are correct, then this will be a fun task to undertake. The network would have to be configured as a full mesh, and we'd likely use something like Raft to determine consensus (Paxos works as well, but since Raft has leader election it's probably a better candidate, not to mention it has a Go implementation).

However, consensus protocols generally do not work well with large clusters, which mean we'd need to implement some sort of task sharding so that smaller clusters can operate independently. This brings to mind a sort of worker approach, where a consensus algorithm is performed on the five or six masters, who then hand off tasks to a worker cluster they own. This would still allow for failover and network partitions while still being performant. But that also brings up the problem of how the master will handle failures in its own worker cluster.

OK, here's my proposal:

  • A cluster of masters, with a maximum of 10 or so, which can double as workers
  • One or more masters can manage a cluster of workers. The collection of masters can use Raft to select which will be the leader and which will be fallovers.
  • Tasks will be specified in an official crontab that will be synced across the masters. The crontab can be changed on one server, and a CLI tool will trigger a sync.
  • The crontab will specify what task to execute, how often, any restrictions on which machine the task must be executed, whether the task access external networks, and whether the task is idempotent and/or atomic.
  • The masters use Paxos to claim tasks for execution and to determine the official crontab specification. The master of a cluster will decide which workers will execute which tasks.
  • Workers will report back to the master when a task finishes or fails. Masters will report back to the network when a task finishes or fails. How failures are handled depends on the specification of the task.
  • We can use Go (and its Raft implementation) for the daemon, and ZeroMQ as a means of message sending.

The advantages of this method are: 1) workers and masters can both fail and somebody will still be able to execute the task, 2) it doesn't require a large number of nodes since masters can double as workers, which means a really small cluster can have just masters, and each master treats itself as its worker cluster, 3) it scales to larger clusters since the hierarchy of masters allows for better handling of network partitions.

The one problem is if a task that is not idempotent is executed, but then a network partition occurs and the network cannot determine if the task finished or not. The task cannot be re-executed because it might cause unwanted side effects, and the network cannot wait for the issue to resolve itself, since Paxos assumes messages take arbitrarily long to deliver. Another issue is that it does not handle byzantine failures, although with any luck that should not be an issue for this daemon.

Thoughts?

I think that's a sound approach; my own initial brain experiments were mostly centered about a network of peers (similar to your master-only scenario); but I agree that separating master vs worker conceptually is the best approach (even if, in practice, masters will also be workers in small installations).

The problem with non-idempotents jobs being potentially executed more than one time, I think, is manageable:

  1. have a means to mark a job as having side effects so that
  2. have a deterministic map for jobs to exactly one (random) worker

such that a job marked as having side effects and absolutely needing exactly one run can only be scheduled on the mapped node and will simply fail otherwise. This way, in case of network partition, only the masters that can still speak to the selected node would even attempt the run.

This trades the risk of the run being executed more than once with the risk that it isn't executed at all -- but since it's selectable the user can make the appropriate tradeoff depending on what is best for the specific situation.

I don't understand why we need to write something from scratch for this, or why it needs to be fully decentralized. Decentralized things are *hard*, especially when building them from scratch. Is there really something wrong with gearman or jenkins?

(In reply to comment #6)

Is there really something wrong with gearman
or jenkins?

The fact that neither of them solves the stated problem set? (Well, I suppose Jenkins could be coerced into doing something similar by combining its job monitoring with some manually crafted schedules; but that's sorta like using a car to crack nuts by driving over them -- it'll /work/ but it's hardly the best solution). Also, Jenkins is exactly as much a SPOF just running cron is; in which case it provides no benefit beyond the ability to farm out the actual run.

I'm not even sure where you see Gearman as even related. It's a fairly nice distributed RPC-like dispatcher, but still requires a process to schedule and start jobs which leads us back to the initial problem.

Neither of those provide any help with the stated objective of "make sure X happens at time T (or interval I)".

Chronos gets close to the stated objective but is (a) highly complex to use and (b) has several boatloads of heavy external dependencies.

This is not a new problem; and it's still actively discussed. May places have constructed workarounds[2] and ad-hoc systems to do just that[3]; I'm hoping we can make a solution that is applicable to the general case.

[1] http://nerds.airbnb.com/introducing-chronos/
[2] http://kvz.io/blog/2012/12/31/lock-your-cronjobs/
[3] http://act.yapc.eu/lpw2012/talk/4291

vladjohn2013 wrote:

Hi, this project is still listed at https://www.mediawiki.org/wiki/Mentorship_programs/Possible_projects#Distributed_cron_replacement

Should this project be still listed in that page? If not, please remove it. If it still makes sense, then it could be moved to the "Featured projects" section if it has community support and mentors.

(In reply to comment #8)

Should this project be still listed in that page?

I'm not sure what the usual workflow is for projects of the sort. I'm certainly willing to mentor, and it's certainly in-scope for a session project for a small team.

Quim? What do you opine our next step should be?

I'd like to echo Ryan here, this proposal sounds very NIH to me in nature.

I expect there is a I'm failing to express the concept properly. Let's try again. There is need for something that:

  • allows endusers to schedule things to run arbitrary snippets of code at specified times or intervals (with their privileges); (using crontabs is a convenience bonus, not a requirement)
  • does not rely on a single non-redundant daemon (no SPOF); and
  • does not rely on a complex infrastructure or set of dependencies (like an entire cluster manager system).

The problem with Jenkins or Gearman isn't that they don't do this perfectly -- it's that they do *none* of this at all. They're not a solution rejected because NIH, they're just not a solution at all (at least not to this use case).

At best, they could be used as components of a solution (though, brittle and overcomplicated (for the task) as it is, Jenkins doesn't seem promising even for that).

(In reply to comment #11)

I expect there is a I'm failing to express the concept properly. Let's try
again. There is need for something that:

  • allows endusers to schedule things to run arbitrary snippets of code at

specified times or intervals (with their privileges); (using crontabs is a
convenience bonus, not a requirement)

  • does not rely on a single non-redundant daemon (no SPOF); and
  • does not rely on a complex infrastructure or set of dependencies (like an

entire cluster manager system).

I'm reminded a little of [[Sun Grid Engine]], which the Toolserver used to use.

I'd suggest writing an [[mw:RFC]] with hard and soft requirements: Bugzilla feels like the wrong forum to suss out requirements.

As Ryan L. and Faidon have suggested, it's possible something like this already exists, though I don't know of anything off-hand. Rather than starting from scratch, I wonder whether, for example, modifying (Vixie) cron to support multiple hosts would make sense.

This sounds like a great idea. I wanted this as well for some time. No matter of result of facebook thing, if you decide to start developing this, ping me I would like to contribute myself as long as its going to be written in a sane language (like native c)

native c probably wouldn't be a sane language for this. Something high level would be good. It's unlikely that raw speed is a major need here.

Anyway, I still think this is a problem that's being overkilled. There's no problem having a SPOF for something that could simply schedule things onto the grid.

Another feature that would be quite nice is the ability to define batch of tasks.

At a previous job we used IBM Tivoli Workload Scheduler. I never used directly myself since I was merely giving scenario to operations. On each server you have an agent which is controlled from a central box, it runs the scenario tasks by tasks and can revert back if something fails. Most batches being run overnight, operations ends up in the morning with a nice report of tasks that passed/failed.

Anyway, it would be nice to start out a page listing the requirements, there are some listed in the comments above already. Then we can got hunt for an open source software that would support them.

On whether this should be a project for Facebook Open Academy program or not: the project could be simply based on the definition of the problem to solve, which doesn't seem to be currently covered by any piece of software that you can install and run.

The project could start with an investigation of the possible solutions, including all the options mentioned here. The actual development doesn't start until February, so there is time to discuss the solution to implement.

This project has been picked up by the Facebook Open Academy.

Part of the first "step" the students will be tasked with is to clearly define the scope and requirements and investigate whether there are other open source components that can be leveraged or built upon - and it seems clear that there may well be.

I did do an extensive investigation myself to find if something could be used as-is (and nothing fits the bill) but there are a number of partial solutions out there that can be built upon. Good design also means knowing when not to design new wheels every time you want a car; and IMO doing that exercise is, in itself, a valuable part of the project.

This sounds me like the job of a scheduler. Aside the old Sun Grid Engine, I briefly used as a user OAR, which is used in research context for HPC applications (e.g. Grid'5000 is a partner). Compared to the former it seems more "distributed" as you want: it claims to be multi-scheduler, and obviously the tasks/jobs are distributed accross a cluster and each unique task can be itself distributed/parallel. As a user it was more flexible than the old SGE, and it seems the admins have many tools to supervise the whole system. http://oar.imag.fr

I strongly agree with Ryan that we should not be writing something from scratch unless we absolutely have to. I haven't used the Sun Grid Engine, but it looks like an opportunity. There's also Apache Mesos https://mesos.apache.org/, which doesn't solve the whole problem but provides some sort of architecture.

Also my friend suggested using something like https://github.com/coreos/etcd to manage distributed state and then just having the cluster machines check that/

(In reply to comment #17)

This project has been picked up by the Facebook Open Academy.

Is there a place (mailing list, wiki page, etc.) to follow the progress of this?

I may not understand the whole essence of the problem...
Why you can not start the task over ssh on servers with a single cron?
Do you suit a script that gets the task and the list of servers, then runs the task on all these servers and waiting for their execution?

I'm with the NIH critics. cronie already provides crontabs on a cluster (-> redundancy), and that can be used with SGE (or anything else) for load distribution and resource management. Plus, it has a userbase in the millions.

The only problem at the moment is that Ubuntu doesn't ship it (cf. https://blueprints.launchpad.net/ubuntu/+spec/security-karmic-replace-cron).

So if the scope of this bug is to package cronie for Ubuntu (or pimp systemd or upstart for clusters), hooray, but if we end up with yet-another something with new bugs and nobody maintaining it, I'd rather pass on that.

(In reply to comment #23)

I'm with the NIH critics. cronie already provides crontabs on a cluster (->
redundancy), and that can be used with SGE (or anything else) for load
distribution and resource management. Plus, it has a userbase in the
millions.

The only problem at the moment is that Ubuntu doesn't ship it (cf.
https://blueprints.launchpad.net/ubuntu/+spec/security-karmic-replace-cron).

So if the scope of this bug is to package cronie for Ubuntu (or pimp systemd
or
upstart for clusters), hooray, but if we end up with yet-another something
with
new bugs and nobody maintaining it, I'd rather pass on that.

Could you possibly explain how cronie is used for cluster execution? Because I cannot find any documentation on it, nor do I see any evidence of it in the source code. (To the contrary, the README specifically says cronie is *not* redundant.)

(In reply to comment #24)

[...]
Could you possibly explain how cronie is used for cluster execution? Because
I
cannot find any documentation on it, nor do I see any evidence of it in the
source code. (To the contrary, the README specifically says cronie is *not*
redundant.)

Which README are you referring to?

cronie - as used on the Toolserver for quite some time now - offers an option "-c" for its crond that enables cluster behaviour. With that, you can share /var/spool/cron for example via NFS. /var/spool/cron/.cron.hostname contains the name of the current host that executes jobs. On failover, you set it to another (remaining) crond host.

(In reply to comment #25)

cronie - as used on the Toolserver for quite some time now - offers an option
"-c" for its crond that enables cluster behaviour. With that, you can share
/var/spool/cron for example via NFS. /var/spool/cron/.cron.hostname contains
the name of the current host that executes jobs. On failover, you set it to
another (remaining) crond host.

Ah, just found it. Thanks.

(In reply to comment #27)

Do any of the following provide a solution, or something that could be used
as
a base for a solution?

  • Quora's distributed chron -

http://engineering.quora.com/Quoras-Distributed-Cron-Architecture

  • The Amazon approach:

http://stackoverflow.com/questions/10061843/how-to-convert-linux-cron-jobs-
to-the-amazon-way

The Quora approach looks half-assed considering they use MySQL anyway.

Chronos, on the other hand, looks really interesting and applicable. I pointed out the possibility of using Apache Mesos as a framework for this project above, but it seems Chronos has beat me to the point.

[not tracking any other tickets; removing keyword]

Closing as fixed since cron just submits to the grid on toollabs now.