workflow management
Open, HighPublic
Actions

Assigned To

None

Authored By

	ArielGlenn
	Aug 17 2016, 1:24 PM

Description

List and evaluate software packages that we might use for job/task/workflow management for the Dumps 2.0 Rewrite. These should also be usable for the current architecture, no reason why not.

Once we have a list of requirements and a list of possible software packages we can generate a grid and assign points for each desired/needed feature.

Related Objects
Search...

Status	Assigned	Task
Open	None	T128513 Dumps 2.0 Platform design questions
Open	None	T128520 What already available software can we build on for a job scheduler?
Stalled	None	T146070 Review progress of dumps rewrite
Open	None	T143205 Draft usage scenarios for job/workflow manager
Open	None	T143206 List requirements needed for task/job/workflow manager.
Open	None	T143207 Evaluate software packages for job/task/workflow management

Event Timeline

ArielGlenn created this task.Aug 17 2016, 1:24 PM

ArielGlenn added a parent task: T143206: List requirements needed for task/job/workflow manager..

Some possiblities in Python:

Airflow (https://github.com/apache/incubator-airflow)
Luigi (https://github.com/spotify/luigi)
Pinball (https://github.com/pinterest/pinball)
Spiff (https://github.com/knipknap/SpiffWorkflow)
Celery (https://github.com/celery/celery)
TaskFlow (https://github.com/openstack/taskflow)
Fireworks (https://github.com/materialsproject/fireworks)
Cosmos2 (https://github.com/LPM-HMS/COSMOS2)
Toil (https://github.com/BD2KGenomics/toil)
Soma-workflow (https://github.com/neurospin/soma-workflow)
Dask (https://github.com/dask/dask)
Workflow (https://github.com/inveniosoftware/workflow)

Some of these may be geared to special purposes and may be eliminated early on from consideration.

As part of a first pass I will be testing a simple setup of each workflow manager for invocation of the current dumps worker.py script with various arguments. This is pretty much the absolute miminum level of functionality we need (tbh we need more than this but it's a start), so if any package takes far too much work to get it right or simply isn't intended for this use, it can be tossed from the list.

I'll be documenting what was needed for each test, and the results, on this task.

Underway now: celery, as one of the lowest-level packages.

ArielGlenn moved this task from Backlog to active on the Dumps-Rewrite board.Aug 22 2016, 2:18 PM

https://github.com/apergos/scheduler_eval Keeping my notes and scripts there; these are not scripts that will ever be reused, no point in cluttering up a WMF repo with them.

Notes from first pass at using celery:

I did not test out the multi worker backend. I know it's there, the mechanics aren't going to be any different than what has been tested already.

Logging is a PITA. Redirection of some messages to some log files and others to stdout without logging, is also a PITA.

Getting the names of modules right in the import statements is another PITA. E.g. read this on relative imports: https://hairycode.org/2013/07/23/first-steps-with-celery-how-to-not-trip/ and then this: http://docs.celeryproject.org/en/latest/userguide/tasks.html#names and it still won't be clear.

For anything that's not doable right out of the box, we will need to use the api. It's got a lot of moving parts, the documentation is a bit sparse, it changed dramatically from Celery 2.x to 3.x.

We'll be doing a lot of building on top of celery in order to handle runs of dump subjobs and not just dump tasks.

It's not clear to me that we can specify limits on how many of task type X run at a time, though I think you can put them in their own queue and put limits that way.

Tasks can be chained; it looks like that's how dependencies get handled is by passing a chain of tasks in the app wrapper or whatever invoked the celery 'apps' (tasks) you have described. In case we don't want to run all the tasks in the chain at once, it appears that there would have to be manual logic in whatever calls the 'apps' (tasks).

Results from the task are whatever the task expclicitly returns. Redis is the standard back end, I don't know what we might do about persistence for it.

Retries appear to work as advertised, but you must build the retries into your celery 'app' right away, it's not automatic.

Preventing multiple jobs with all the same parameters from running at once: no idea yet. Likewise, I haven't checked into how you would remove a job from the worker queue or if it is even possible.

Adding new tasks seems pretty straightforward, so there's that.

Any sort of map/reduce functionality would be done in the scripts that call the apps (tasks).

Notes from a first test of airflow are now available at https://github.com/apergos/scheduler_eval, see the airflow directory.

Airflow is built on top of celery so it can execute jobs on different remote workers out of the box.

When testing, all output, messages from the dump script go to console, and when running a 'production' run, the messages go to the logs subdirectory of AIRFLOW_HOME, in subdirs by dag and task name, filed by date.

I have not tested retries yet, but the number f retries and the delay are both configurable and are passed as arguments to the dag which in turn is passed to the task definition.

Jobs can be removed from the queue via the web api; no idea how this could be done from the command line. Jobs that have been queued are all listed in the web ui. You can filter via regexp, which is handy. You can't filter based on params to the task though (i.e. laguage name or wiki type).

Map-reduce functionality woul dhave to be added outside of airflow AFAICT.

As with celery, this is not intended to be a comprehensive eval but a first rough overview.

Notes and sorry-ass test scripts moved to https://github.com/wikimedia/dump-scheduler-eval thanks to Chad.

Notes added on luigi eval, to the https://github.com/apergos/scheduler_eval repo again unfortunately. There's something funky with the invite to me for the wikimedia repo, as soon as that's worked out I'll move the files over.

ArielGlenn mentioned this in T143205: Draft usage scenarios for job/workflow manager.Sep 19 2016, 12:03 PM

Halfak mentioned this in T146070: Review progress of dumps rewrite.Sep 19 2016, 6:19 PM

Adding oozie to the eval list, as it's already in use in-house. Even if not selected, we may be able to feed some things to it for retrieval of certain dump-related information from the analytics cluster, so it merits an in-depth look. I'm checking into software for creating mini hadoop clusters with hdfs out of docker containers right now.

For those keeping track, I bailed on the hadoop docker containers. After much hair-pulling and chest-beating I finally have hadoop and yarn set up appropriately for local single-node testing and have successfully run a job on it, oozie is yet to come. However I have been hacking together the workflow.xml and properties files plus some wrapper scripts that would be needed for this test case. I will have extensive review notes soon, there's a lot to comment on here.

oozie is set up locally in pseudo-node mode. I have a lot to say about Apache's documentation, not all of it flattering. Nonetheless, I've successfully run one of the examples without it being just a matter of copy-paste. There are a few links still broken with the yarn web ui, which I need to fix up. Apache Tomcat (oozie ui) is whining, probably a classpath issue. I should be able to get my test case going soon though, since jobs run properly to completion.

The apache tomcat issue turns out to be a known bug with openjdk 1.8.0 and cdh 5; the solution is to downgrade to openjdk 1.7.0. (Why is this not advertised in the install instructions?)

The yarn web ui is still a problem. I actually looked at the code out of desperation, and it looks like it should do what it says: get the value for yar.resourcemanager.webapp.address from the config file or, if there is none, fall back to the default. But it's not doing so. I've checked and quadriple-checked the yarn-site.xml file and changed everything related to make sure every possible setting points to localhost; no joy. Officially giving up. Nonetheless, the rest seems to work as advertised, and the oozie web console is nice for an admin. It's less nice for an enduser who is trying to keep track of, say, the next dump for their project.

Now off to finish up my test case (worker.py dump script) and run it.

At last github has sorted out whatever issues it had with some admin things. The invite ostriches sent to me to admin https://github.com/wikimedia/dump-scheduler-eval finally stopped being a 404 and showed up today. The luigi notes and test scripts are now available there.

Halelujah, two O'Reillybooks and a lot of grief later, I got a script referring to another script by absolute local path (and not hdfs) to run. This should mean that we don't have to push out all the copies of mediawiki to all the HDFS nodes.

Yes, it took several days before there was any new update here. If one doesn't happen to find via Google an example of exactly what one is trying to run, there's no hope for it in the docs. I quote from "Hadoop: The Definitive Guide":

"In general, you can tell the component where a property should be set by its name, so the fact that
yarn.nodemanager.resource.memory-mb starts with yarn.nodemanager gives you a clue that it can
be set only for the node manager daemon. This is not a hard and fast rule, however, so in some cases
you may need to resort to trial and error, or even to reading the source."

That is just to figure out where a certain configuration property ought to be set, let alone how to use it. I am now to wrap my python script so there's a cwd as appropriate, stdout is redirected to a file (can't seem to get oozie to put it anywhere unless it's provided to the workflow as Java Properties formatted output with max 2KB) and see if we can't finally wrap this test up.

At last! I have a successful run of one job for one data for one wiki via oozie, which is all that I have done for any of the other evals. I'll be writing up a full set of notes to go along with my horrid test scripts into the eval repo.

I've added the sample scripts for oozie testing at https://github.com/wikimedia/dump-scheduler-eval
More notes on the actual testing, various gotchas, general observations on Oozie and the Hadoop infrastructure, and some unresolved issues, coming soon.

Notes on installation, configuration, setup, unsolved issues, and wmf-specific questions have been added. Still to come: notes on job configuration and testing, general observations on Hadoop/Oozie.

Notes on job configuration and testing, general architecture and interesting tidbits added for oozie/hadoop. Still more to do.

Note that the priority values in this list have been shifted around a little from T143206, as I added a few more things, and combined in the list at the bottom of that ticket which had no priorities. After the values were set, I filled in the grid. Here we go:

feature	priority	luigi	airflow	oozie+hadoop	celery
User auth for web status info if sensitive data/operations	3	-	X	-	X("flower")
Centralized logging	4	-	- (can be sent to S 3, Google cloud. eyeroll)	-	-
Encryption between hosts for sensitive data	4	no cross-host data	only for ldap/auth	X	-
Has packages for ubuntu/debian	4	-	-	X	X
Stable code base/apis	4	X	- (apache incubator)	X	X
Email alert on task failure	5	X	X	X	X
Remove prev output and rerun	5	-	-	-	-
Multiple queues with distinct priorities	5	no notion of central queues	X	X	X
Chained jobs (output of one job as input to the next, entire chain may be run at once)	5	X	X	X	X
Task/job deletion	5	-	- (but a running DAG can be paused/resumed)	- (but can be suspended/resumed/killed)	you can "revoke" a task
No duplicate jobs/tasks running at the same time	5	X	X ("max active runs")	X (via oozie coordinator)	locking: http://docs.celeryproject.org/en/latest/tutorials/task-cookbook.html#ensuring-a-task-is-only-executed-one-at-a-time
Responsive upstream developers/maintainers	6	X	X	X	X
Healthy support community	6	small but responsive	small but active	X	X
Map/reduce hooks	6	X via hadoop	- (unless one can do this via hive/pig/other similar)	X	-
Task history available	6	X	X	X	X
Support for tasks other than python scripts	7	X	X	X	with python wrapper
Job concurrency limit across all workers	7	X (via arbitary resource tag)	X (slot pool)	X	X
Configurable retries	7	X	X	- (only for action start failure, if it starts successfully, the executor (hadoop or whatever) must retry	X
Task priorities	8	X	X	- (only by queues)	-
Task dependencies	8	X	X	X	X ("chords")
Written in a language ops knows and can support (Python preferred)	8	X	X	Java	X
Resource specification/management (at least CPU cores per job)	8	X	depends on task type	X	- (can set number of workers per node but no notion of cores or memory per task)
FIFO-ish (queue processing mostly in order submitted)	8	arbitrary from all jobs with dependencies met	- (sorted by priority and desired start date/time)	X	X
Parallelization/recombine hooks	9	-	-	-	X ("chords")
Monitoring (see table below)	9	16/23	0/23	5/23 *	0/23
Task arguments	9	X	X	X	X
Distributed jobs across workers per task	9	-	X	X	X
API for extending functionality (Python preferred)	9	X	X	X(java)	X
State recovery after crash (implies persistent storage of job statuses)	9	scheduler does not keep state	X (mysql supported)	X (mariadb supported)	X (mysql supported)
API or cli for job status checks	9	X	X	X	X

* use of "HUE" may add some capabilities

Monitoring gets its own section here, weights from 1 to 5. None of these are blockers but the greater the total, the better.

available for monitoring	weight	luigi	airflow	oozie+hadoop	celery
Estimated start time for a job or task	1	-	-	-	-
Estimated completion time for a job or task	1	-	-	-	-
Show completed jobs or tasks for a given time frame (restrict on regexp)	3	no regexp but has param matching	-	-	-
Current status of jobs/tasks, restricting on regexp	3	no regexp but has param matching	-	-	-
Temporary/permanent failure of job/task, number of retries done/left, error output, start/end time	5	- (afaik)	-	-	-
Jobs/tasks on given worker(s) (restrict on regexp)	5	no regexp but has param matching	-	-	-
Jobs/tasks claimed but not run	5	no notion of 'claimed' jobs, jobs submitted to a worker can be shown	- (they can be "prefetched" but this isn't a status that is available afaik)	not exactly, has "pending"	-

Sum totals are:
luigi 121, oozie/hadoop 130, airflow 133, celery 150

Now a few comments on each of these.

Oozie/hadoop is intended for data that has been added as records to HDFS, is going to be massively parallel processed via some query and a result generated. That's not really what we are doing with the dumps. Nonetheless, its capability list is impressive. The learning curve is quite high, but we do have some in-house expertise. As far and away the most mature project, it has the benefit of packages for all platforms and a large support community.

Celery is quite flexibile but it's really a library meant for you to build your own platform; only if the other choices were well off the mark would we resort to this, even though there is in-house celery experience.

Airflow uses celery under the hood, so it gets the benefit of some of those features. If needed, we could probably submit upstream patches to expose functionality celery has and Airflow does not, such as deletion of tasks from a queue. Note that Airflow is an immature project, though with a lot of users jumping on the bandwagon. It's still in the apache incubator, so I would expect API and other changes more often than with the other options.

Luigi was the easiest to set up and get going, but it's almost too lightweight for what we want. There's no central scheduler that stuffs jobs in a queue; the user must start workers directly on worker nodes and submit tasks to them, typically via cron. Additionally, all jobs in a task run on the same worker. These issues could be worked around but it would really only be worth it if there was a slam dunk on the rest of the requirements.

I'll be commenting shortly on the other alternatives in the eval list.

Notes on the other contenders in the eval list, below. Most were designed for specialized uses and so not suitable for us, or depended on particular back ends such as AWS or Grid Engine. A couple were simply very inactive and thus not investigated further.

Pinball - NO DOCS. thread-based python app (hmm), no substantial commits since April, user mailing list is pretty dead for the past year
Spiff - very inactive, a couple small changes in repo 3 and 10 months ago, after that activity 2 years ago
TaskFlow - OpenStack product, designed for orchestration, monitoring and management of OpenStack projects and resources, pretty specific to that infrastructure
Fireworks - intended for calculation workflows, reuires MongoDB and a backend of Torque, GridEngine, etc. Job queueing seems wonky, see https://pythonhosted.org/FireWorks/queue_tutorial.html and https://pythonhosted.org/FireWorks/queue_tutorial_pt2.html
Cosmos2 - developed for genomics, front end intended for use with AWS, StarCluster, Grid Engine, etc.
Toil - developed for genomics, uses CWL for workflow definitions, depends on back end for resource management etc, intended for use with AWS, Google Cloud Storage, etc.
Soma-workflow - has a nice GUI and an API but... designed as front end for submission/montoring of tasks to GridEngine, Torque, etc. Depends on third party computing cluster back end.
Dask - very active and widely used but intended for workflows for NumPy, DataFrame, etc, not general purpose
Workflow - last release Oct 2014. state-machine-based front end to celery, requires coding celery "apps", much less functionality than airflow, developed as part of invenio (managing digital bibliographic data)

Hydriz subscribed.Oct 25 2016, 6:51 AM

Volans subscribed.Oct 27 2016, 10:55 AM

@ArielGlenn regarding the Debian packages luigi has a debian/ directory in the repo, so might be straight forward to build it.

This evaluation (as far as I can see) doesn't seem to take into account that oozie is deployed and working at a scale at WMF to do a very similar job than what the dump reconstruction would need. Ozzie used in conjunction with hadoop, that is.

Argh! I had in my notes to add 'in use in the org' (celery, oozie) and forgot to add it. Good catch!

Ottomata mentioned this in T237361: Discuss common needs in a job manager/scheduler.Nov 5 2019, 3:55 PM

This task has been assigned to the same task owner for more than two years. Resetting task assignee due to inactivity, to decrease task cookie-licking and to get a slightly more realistic overview of plans. Please feel free to assign this task to yourself again if you still realistically work or plan to work on this task - it would be welcome!

For tips how to manage individual work in Phabricator (noisy notifications, lists of task, etc.), see https://phabricator.wikimedia.org/T228575#6237124 for available options.
(For the records, two emails were sent to assignee addresses before resetting assignees. See T228575 for more info and for potential feedback. Thanks!)

Aklapper added a parent task: T146070: Review progress of dumps rewrite.Nov 3 2020, 7:13 PM

nshahquinn-wmf subscribed.Dec 14 2020, 5:20 PM

Evaluate software packages for job/task/workflow managementOpen, HighPublicActions

Description

Related ObjectsSearch...

Event Timeline

Evaluate software packages for job/task/workflow management
Open, HighPublic
Actions

Related Objects
Search...