Page MenuHomePhabricator

Draft usage scenarios for job/workflow manager
Open, HighPublic


To help generate a complete list of requirements for a job/task/workflow manager, we should have some typical use scenarios in mind. Let's collect some here.

Event Timeline

ArielGlenn moved this task from Backlog to active on the Dumps-Rewrite board.Aug 17 2016, 1:38 PM
ArielGlenn added a comment.EditedAug 17 2016, 2:30 PM

A note on terminology:
I use 'task' for a dump step (i.e. 'stubs', 'current page contents'), and 'job' for a piece of that step (i.e. stubs.xml3.gz). Think of jobs as subunits of tasks.

Task priorities:
Analysts want the metadata about all pages early in the month for stats generation, so it's great to be able to schedule tasks that generate those (currently "stubs") on a given date via cron and have them get top priority over anything else that might be submitted by other cron jobs (currently wikidata dumps, cirrus search dumps, etc.)

Task dependencies:
The current dump of page contents requires that we have the metadata dump for those pages completed first. Then content for each revision is retrieved either from disk or from the database. While this mechanism might change in the rewrite, we'll likely still want to be able to specifify that one task depends on another having completed successfully.

Map/reduce hooks:
We currently want dumps of e.g. revision content for all revisions. This gets done in little pieces. The way we split it into little pieces is quite horrendous and requires rebalancing of the numbers every so often. We should be able to specify a callback that will create a bunch of subjobs for a task, all of which take some reasonable length of time to run such that a rerun of any one of them is "cheap enough". We should also be able to specify a callback that will combine output from completed jobs into files to be published to the public for download/consumption.

Users like to know:

  • when will the next dump for project X start?
  • when will the current dump for project X finish?
  • what tasks are complete for the current run of project X?
  • what is the dump run for project X doing right now?
  • tell me everything about all the latest dumps for language Y
  • tell me everything about the latest dumps for project Z on all languages

Opsen (like me) want to know:

  • what's broken?
  • what errors were generated?
  • when did the broken task(s) start?
  • when did they stop/die?
  • what's running on worker W?
  • which jobs did worker W claim but they aren't running and aren't done either?
  • which jobs have failed and how many retries have been made?

Job concurrency limit across all workers
Recently jcrespo noticed slave lag on the vslow/dumps database for s1 (en wp). It turned out that other tasks run at the beginning of the month in combination with the 27 page metadata dump jobs (stubs) was enough to push the server over the edge. When I configured the stubs to run in two batches with a max concurrency of 14 jobs, there was no lag. We'll want to be able to specify such a limit.

Forced reruns
Most of the time, an output file for a job means that the job ran successfully, and we would want a rerun to do nothing. But sometimes the output is bad because the job was interrupted in the middle, the box rebooted, there was a network glitch, etc. On these cases we want to be able to force a rerun which would regenerate all the output. It would be nice to not have to go in and remove output files manually for that to happen.

Task arguments
We use one script for all the runs. I sure don't want to have to write a separate class for each task for each of 800-something projects, so we ought to be able to pass in parameters easily.

Support for tasks other than python scripts
Right now we run a number of bash scripts out of cron, e.g. wikidata dumps, cirrus search dumps, etc. I imagine that with the dumps reqrite we'll have even more folks wanting to generate weekly datasets of one sort or another. It would be nice to be able to easily convert these into tasks.

Configurable retries
One of the things that saves us with the current dump setup is that the end 'stage' of the dump run generates dumps for 'all steps that have not completed successfully. This provides a chance for broken tasks to rerun. We'll want some sort of retry ability as well as the ability to bail after some max retries, to wait some length of time between retries, etc.

Distributed jobs across workers
Currently each task runs on one and only one server (snapshot host). For example, all 27 jobs that generate revision content for all revisions on en wikipedia run on the same snapshot host. It would be very nice indeed if any worker in the pool that had available resources could claim and run a job. In particular, we usually wind up waiting for 4 wikidata tasks to complete for a few days after everything else is done. Imagine if these could be split up properly into jobs and passed around to the other snapshot hosts to complete within a few hours instead. This would also allow us to take worker downtime or allow security updates with minimal impact to rump production.

I get asked for new dumps or one sort or another all the time. It would be nice if it were relatively easy for me and for others to add these new dumps as tasks. This biases me towards python as the platform language right away, or at the least that there should be a complete API for the platform in python. If other folks are happier in other languages, we might want support for them too (which ones?)

Task/job deletion
Sometimes I notice that a task is stuck and I just want to make it go away and reschedule it later. I've encountered this 3 out of the last 3 months in fact. This usually has to do with a bug in the code, but guess what, there will always be bugs. So the ability to have the manager shoot an existing task/jobs of that task, clean up the broken files and (potentially) leave any output files from completed jobs of that task, and (potentially) reschedule for a time of my chosing later, would sure be nice.

No duplicate jobs/tasks running at the same time
This should go without saying, but just in case. Right now we do a bunch of annoying locking etc to avoid this. The granularity isn't good enough to run multiple jobs for a task (or different tasks) at the same time, something that again is silly in a distributed execution framework. We shouldn't need locks at all.

Resource specification/management
A given job may take more resources than another. For example, the wikidata json dumps run several processes concurrently I believe. (Check this!) We should be able to specify the resources any job will use (CPU cores at least) so that workers will only claim jobs when they have sufficient resources available.

State recovery after crash
If the host goes down, the process, dies etc., it's a waste of time to rerun everything from the start. Although the XML/sql dumps are set up so that this can be done without harm, with completed tasks being skipped, it would be better if the manager picked up where it left off. This is especially important for dumps run weekly or daily out of cron, where they need to complete within a day.

We add tasks to run based on the order we want them to get done. This means we want metadata done early, we want current page content done pretty soon after that for all wikis, and page content history for all wikis can wait til near the end. In particular, we don't want to be locked into a situation where we specify tasks with dependencies for a wiki (DAG) and all elements in the DAG must run one after another.

Multiple queues
Misc jobs that used to be run entirely out of cron could go in one queue, en wikipedia jobs in another, big wikis in a third, small wikis in a 4th (for example), so that dumps of small wikis don't get backed up behind everything else, so that the misc jobs are completed on the date they are submitted to the queue, etc.

API for job status checks
We don't want to rely on scaprig the output of some web interface in order to provide status updates to e.g. irc channels. Rather, we want an API so that we can generate RSS feeds, html files, irc channel or other updates automatically when a job or a task completes. See also Monitoring

Chained jobs
The output of some tasks may be used as input to the next immediately, Example: page content production -> format conversion -> compression of various types.

Centralized logging
It would be great if all error output or progress reports could be centrally collected rather than stashed in various places on each worker host. This might be out of scope i.e. perhaps the right solution to this is that every script for a dump task sends its logs via rsyslog to a logging host.

Aklapper renamed this task from Draft usage scanarios for job/workflow manager to Draft usage scenarios for job/workflow manager.Aug 17 2016, 5:52 PM

I've gone through all the past notes on the rewrite project, my notes in various places for the scheduler/workflow manager, and various other tickets on dumps improvement. Everything I could pull out of those is in the above list. If ANYONE HAS SOMETHING TO ADD please don't be shy. I expect to settle on a solution in mid-November, so there's plenty of time to get this right. Adding a couple projects with folks that might have some thoughts on this!

@ArielGlen : I think we shouldn't need to choose a scheduling solution when we already have one that works in a distributed computing setup: oozie.

Maybe you can get in touch with @Ottomata at the ops meeting about using hadoop oozie for your purposes. We just wrapped up our prototype to reconstruct edit history from the mediwaiki database, dumps can benefit from all that work. There is no need to deploy another parallel computing solution when computations can be dump in hadoop.

cc @mark

Hey Nuria!

I'm very interested in this, though we dump lots of other data. What formats are available for export? Where can I look at code and data? Is there a playground in labs perhaps?

Also, I'll check in with Ottomata for sure, but maybe you would want to look at the list of requirements (T143206), or, alternatively, the list of packages already under consideration (T143207), and add/comment on oozie's capabilities?

Additionally, is there other work analytics plans to do that could be used perhaps for generating other dumps? Maybe it makes sense to move more of these things into the analytics platform and eventually hand them over to the team?

Nuria added a comment.Sep 19 2016, 4:19 PM

Is there a playground in labs perhaps?

no, there is not. We have tried to this before with little success.

What formats are available for export?

oozie is an scheduling system that runs code, you can make formats available for export to your code as needed be.

/comment on oozie's capabilities?

oozie can do pretty much everything you have outlined on this ticket. I think it will help if you get familiar with oozie, see how it is used and play with it for a while. There is plenty documentation online but here are some docs for our setup:

It requires access to 1002/2004 which you already must have

ggellerman moved this task from Backlog to Radar on the Research-Backlog board.
Aklapper removed ArielGlenn as the assignee of this task.Jun 19 2020, 4:18 PM

This task has been assigned to the same task owner for more than two years. Resetting task assignee due to inactivity, to decrease task cookie-licking and to get a slightly more realistic overview of plans. Please feel free to assign this task to yourself again if you still realistically work or plan to work on this task - it would be welcome!

For tips how to manage individual work in Phabricator (noisy notifications, lists of task, etc.), see for available options.
(For the records, two emails were sent to assignee addresses before resetting assignees. See T228575 for more info and for potential feedback. Thanks!)