Page MenuHomePhabricator

Evaluate software packages for job/task/workflow management
Open, HighPublic


List and evaluate software packages that we might use for job/task/workflow management for the Dumps 2.0 Rewrite. These should also be usable for the current architecture, no reason why not.

Once we have a list of requirements and a list of possible software packages we can generate a grid and assign points for each desired/needed feature.

Event Timeline

As part of a first pass I will be testing a simple setup of each workflow manager for invocation of the current dumps script with various arguments. This is pretty much the absolute miminum level of functionality we need (tbh we need more than this but it's a start), so if any package takes far too much work to get it right or simply isn't intended for this use, it can be tossed from the list.

I'll be documenting what was needed for each test, and the results, on this task.

Underway now: celery, as one of the lowest-level packages.

ArielGlenn moved this task from Backlog to active on the Dumps-Rewrite board.Aug 22 2016, 2:18 PM Keeping my notes and scripts there; these are not scripts that will ever be reused, no point in cluttering up a WMF repo with them.

Notes from first pass at using celery:

I did not test out the multi worker backend. I know it's there, the mechanics aren't going to be any different than what has been tested already.

Logging is a PITA. Redirection of some messages to some log files and others to stdout without logging, is also a PITA.

Getting the names of modules right in the import statements is another PITA. E.g. read this on relative imports: and then this: and it still won't be clear.

For anything that's not doable right out of the box, we will need to use the api. It's got a lot of moving parts, the documentation is a bit sparse, it changed dramatically from Celery 2.x to 3.x.

We'll be doing a lot of building on top of celery in order to handle runs of dump subjobs and not just dump tasks.

It's not clear to me that we can specify limits on how many of task type X run at a time, though I think you can put them in their own queue and put limits that way.

Tasks can be chained; it looks like that's how dependencies get handled is by passing a chain of tasks in the app wrapper or whatever invoked the celery 'apps' (tasks) you have described. In case we don't want to run all the tasks in the chain at once, it appears that there would have to be manual logic in whatever calls the 'apps' (tasks).

Results from the task are whatever the task expclicitly returns. Redis is the standard back end, I don't know what we might do about persistence for it.

Retries appear to work as advertised, but you must build the retries into your celery 'app' right away, it's not automatic.

Preventing multiple jobs with all the same parameters from running at once: no idea yet. Likewise, I haven't checked into how you would remove a job from the worker queue or if it is even possible.

Adding new tasks seems pretty straightforward, so there's that.

Any sort of map/reduce functionality would be done in the scripts that call the apps (tasks).

Notes from a first test of airflow are now available at, see the airflow directory.

Airflow is built on top of celery so it can execute jobs on different remote workers out of the box.

When testing, all output, messages from the dump script go to console, and when running a 'production' run, the messages go to the logs subdirectory of AIRFLOW_HOME, in subdirs by dag and task name, filed by date.

I have not tested retries yet, but the number f retries and the delay are both configurable and are passed as arguments to the dag which in turn is passed to the task definition.

Jobs can be removed from the queue via the web api; no idea how this could be done from the command line. Jobs that have been queued are all listed in the web ui. You can filter via regexp, which is handy. You can't filter based on params to the task though (i.e. laguage name or wiki type).

Map-reduce functionality woul dhave to be added outside of airflow AFAICT.

As with celery, this is not intended to be a comprehensive eval but a first rough overview.

Notes and sorry-ass test scripts moved to thanks to Chad.

Notes added on luigi eval, to the repo again unfortunately. There's something funky with the invite to me for the wikimedia repo, as soon as that's worked out I'll move the files over.

Adding oozie to the eval list, as it's already in use in-house. Even if not selected, we may be able to feed some things to it for retrieval of certain dump-related information from the analytics cluster, so it merits an in-depth look. I'm checking into software for creating mini hadoop clusters with hdfs out of docker containers right now.

For those keeping track, I bailed on the hadoop docker containers. After much hair-pulling and chest-beating I finally have hadoop and yarn set up appropriately for local single-node testing and have successfully run a job on it, oozie is yet to come. However I have been hacking together the workflow.xml and properties files plus some wrapper scripts that would be needed for this test case. I will have extensive review notes soon, there's a lot to comment on here.

oozie is set up locally in pseudo-node mode. I have a lot to say about Apache's documentation, not all of it flattering. Nonetheless, I've successfully run one of the examples without it being just a matter of copy-paste. There are a few links still broken with the yarn web ui, which I need to fix up. Apache Tomcat (oozie ui) is whining, probably a classpath issue. I should be able to get my test case going soon though, since jobs run properly to completion.

The apache tomcat issue turns out to be a known bug with openjdk 1.8.0 and cdh 5; the solution is to downgrade to openjdk 1.7.0. (Why is this not advertised in the install instructions?)

The yarn web ui is still a problem. I actually looked at the code out of desperation, and it looks like it should do what it says: get the value for yar.resourcemanager.webapp.address from the config file or, if there is none, fall back to the default. But it's not doing so. I've checked and quadriple-checked the yarn-site.xml file and changed everything related to make sure every possible setting points to localhost; no joy. Officially giving up. Nonetheless, the rest seems to work as advertised, and the oozie web console is nice for an admin. It's less nice for an enduser who is trying to keep track of, say, the next dump for their project.

Now off to finish up my test case ( dump script) and run it.

At last github has sorted out whatever issues it had with some admin things. The invite ostriches sent to me to admin finally stopped being a 404 and showed up today. The luigi notes and test scripts are now available there.

Halelujah, two O'Reillybooks and a lot of grief later, I got a script referring to another script by absolute local path (and not hdfs) to run. This should mean that we don't have to push out all the copies of mediawiki to all the HDFS nodes.

Yes, it took several days before there was any new update here. If one doesn't happen to find via Google an example of exactly what one is trying to run, there's no hope for it in the docs. I quote from "Hadoop: The Definitive Guide":

"In general, you can tell the component where a property should be set by its name, so the fact that
yarn.nodemanager.resource.memory-mb starts with yarn.nodemanager gives you a clue that it can
be set only for the node manager daemon. This is not a hard and fast rule, however, so in some cases
you may need to resort to trial and error, or even to reading the source."

That is just to figure out where a certain configuration property ought to be set, let alone how to use it. I am now to wrap my python script so there's a cwd as appropriate, stdout is redirected to a file (can't seem to get oozie to put it anywhere unless it's provided to the workflow as Java Properties formatted output with max 2KB) and see if we can't finally wrap this test up.

At last! I have a successful run of one job for one data for one wiki via oozie, which is all that I have done for any of the other evals. I'll be writing up a full set of notes to go along with my horrid test scripts into the eval repo.

I've added the sample scripts for oozie testing at
More notes on the actual testing, various gotchas, general observations on Oozie and the Hadoop infrastructure, and some unresolved issues, coming soon.

Notes on installation, configuration, setup, unsolved issues, and wmf-specific questions have been added. Still to come: notes on job configuration and testing, general observations on Hadoop/Oozie.

Notes on job configuration and testing, general architecture and interesting tidbits added for oozie/hadoop. Still more to do.

ArielGlenn added a comment.EditedOct 17 2016, 2:48 PM

Note that the priority values in this list have been shifted around a little from T143206, as I added a few more things, and combined in the list at the bottom of that ticket which had no priorities. After the values were set, I filled in the grid. Here we go:

User auth for web status info if sensitive data/operations3-X-X("flower")
Centralized logging4-- (can be sent to S 3, Google cloud. eyeroll)--
Encryption between hosts for sensitive data4no cross-host dataonly for ldap/authX-
Has packages for ubuntu/debian4--XX
Stable code base/apis4X- (apache incubator)XX
Email alert on task failure5XXXX
Remove prev output and rerun5----
Multiple queues with distinct priorities5no notion of central queuesXXX
Chained jobs (output of one job as input to the next, entire chain may be run at once)5XXXX
Task/job deletion5-- (but a running DAG can be paused/resumed)- (but can be suspended/resumed/killed)you can "revoke" a task
No duplicate jobs/tasks running at the same time5XX ("max active runs")X (via oozie coordinator)locking:
Responsive upstream developers/maintainers6XXXX
Healthy support community6small but responsivesmall but activeXX
Map/reduce hooks6X via hadoop- (unless one can do this via hive/pig/other similar)X-
Task history available6XXXX
Support for tasks other than python scripts7XXXwith python wrapper
Job concurrency limit across all workers7X (via arbitary resource tag)X (slot pool)XX
Configurable retries7XX- (only for action start failure, if it starts successfully, the executor (hadoop or whatever) must retryX
Task priorities8XX- (only by queues)-
Task dependencies8XXXX ("chords")
Written in a language ops knows and can support (Python preferred)8XXJavaX
Resource specification/management (at least CPU cores per job)8Xdepends on task typeX- (can set number of workers per node but no notion of cores or memory per task)
FIFO-ish (queue processing mostly in order submitted)8arbitrary from all jobs with dependencies met- (sorted by priority and desired start date/time)XX
Parallelization/recombine hooks9---X ("chords")
Monitoring (see table below)916/230/235/23 *0/23
Task arguments9XXXX
Distributed jobs across workers per task9-XXX
API for extending functionality (Python preferred)9XXX(java)X
State recovery after crash (implies persistent storage of job statuses)9scheduler does not keep stateX (mysql supported)X (mariadb supported)X (mysql supported)
API or cli for job status checks9XXXX

* use of "HUE" may add some capabilities

Monitoring gets its own section here, weights from 1 to 5. None of these are blockers but the greater the total, the better.

available for monitoringweightluigiairflowoozie+hadoopcelery
Estimated start time for a job or task1----
Estimated completion time for a job or task1----
Show completed jobs or tasks for a given time frame (restrict on regexp)3no regexp but has param matching---
Current status of jobs/tasks, restricting on regexp3no regexp but has param matching---
Temporary/permanent failure of job/task, number of retries done/left, error output, start/end time5- (afaik)---
Jobs/tasks on given worker(s) (restrict on regexp)5no regexp but has param matching---
Jobs/tasks claimed but not run5no notion of 'claimed' jobs, jobs submitted to a worker can be shown- (they can be "prefetched" but this isn't a status that is available afaik)not exactly, has "pending"-

Sum totals are:
luigi 121, oozie/hadoop 130, airflow 133, celery 150

Now a few comments on each of these.

Oozie/hadoop is intended for data that has been added as records to HDFS, is going to be massively parallel processed via some query and a result generated. That's not really what we are doing with the dumps. Nonetheless, its capability list is impressive. The learning curve is quite high, but we do have some in-house expertise. As far and away the most mature project, it has the benefit of packages for all platforms and a large support community.

Celery is quite flexibile but it's really a library meant for you to build your own platform; only if the other choices were well off the mark would we resort to this, even though there is in-house celery experience.

Airflow uses celery under the hood, so it gets the benefit of some of those features. If needed, we could probably submit upstream patches to expose functionality celery has and Airflow does not, such as deletion of tasks from a queue. Note that Airflow is an immature project, though with a lot of users jumping on the bandwagon. It's still in the apache incubator, so I would expect API and other changes more often than with the other options.

Luigi was the easiest to set up and get going, but it's almost too lightweight for what we want. There's no central scheduler that stuffs jobs in a queue; the user must start workers directly on worker nodes and submit tasks to them, typically via cron. Additionally, all jobs in a task run on the same worker. These issues could be worked around but it would really only be worth it if there was a slam dunk on the rest of the requirements.

I'll be commenting shortly on the other alternatives in the eval list.

Notes on the other contenders in the eval list, below. Most were designed for specialized uses and so not suitable for us, or depended on particular back ends such as AWS or Grid Engine. A couple were simply very inactive and thus not investigated further.

  • Pinball - NO DOCS. thread-based python app (hmm), no substantial commits since April, user mailing list is pretty dead for the past year
  • Spiff - very inactive, a couple small changes in repo 3 and 10 months ago, after that activity 2 years ago
  • TaskFlow - OpenStack product, designed for orchestration, monitoring and management of OpenStack projects and resources, pretty specific to that infrastructure
  • Fireworks - intended for calculation workflows, reuires MongoDB and a backend of Torque, GridEngine, etc. Job queueing seems wonky, see and
  • Cosmos2 - developed for genomics, front end intended for use with AWS, StarCluster, Grid Engine, etc.
  • Toil - developed for genomics, uses CWL for workflow definitions, depends on back end for resource management etc, intended for use with AWS, Google Cloud Storage, etc.
  • Soma-workflow - has a nice GUI and an API but... designed as front end for submission/montoring of tasks to GridEngine, Torque, etc. Depends on third party computing cluster back end.
  • Dask - very active and widely used but intended for workflows for NumPy, DataFrame, etc, not general purpose
  • Workflow - last release Oct 2014. state-machine-based front end to celery, requires coding celery "apps", much less functionality than airflow, developed as part of invenio (managing digital bibliographic data)
Hydriz added a subscriber: Hydriz.Oct 25 2016, 6:51 AM
Volans added a subscriber: Volans.Oct 27 2016, 10:55 AM

@ArielGlenn regarding the Debian packages luigi has a debian/ directory in the repo, so might be straight forward to build it.

Nuria added a subscriber: Nuria.EditedNov 2 2016, 4:57 PM

This evaluation (as far as I can see) doesn't seem to take into account that oozie is deployed and working at a scale at WMF to do a very similar job than what the dump reconstruction would need. Ozzie used in conjunction with hadoop, that is.

Argh! I had in my notes to add 'in use in the org' (celery, oozie) and forgot to add it. Good catch!

Aklapper removed ArielGlenn as the assignee of this task.Jun 19 2020, 4:18 PM

This task has been assigned to the same task owner for more than two years. Resetting task assignee due to inactivity, to decrease task cookie-licking and to get a slightly more realistic overview of plans. Please feel free to assign this task to yourself again if you still realistically work or plan to work on this task - it would be welcome!

For tips how to manage individual work in Phabricator (noisy notifications, lists of task, etc.), see for available options.
(For the records, two emails were sent to assignee addresses before resetting assignees. See T228575 for more info and for potential feedback. Thanks!)