To help generate a complete list of requirements for a job/task/workflow manager, we should have some typical use scenarios in mind. Let's collect some here.
|Open||None||T128513 Dumps 2.0 Platform design questions|
|Open||None||T128520 What already available software can we build on for a job scheduler?|
|Stalled||None||T146070 Review progress of dumps rewrite|
|Open||None||T143205 Draft usage scenarios for job/workflow manager|
|Open||None||T143206 List requirements needed for task/job/workflow manager.|
|Open||None||T143207 Evaluate software packages for job/task/workflow management|
- Mentioned In
- T146070: Review progress of dumps rewrite
T143206: List requirements needed for task/job/workflow manager.
- Mentioned Here
- T228575: Decrease number of open tickets with assignee field set for more than two years (aka cookie licking) (March-June 2020 edition)
T143206: List requirements needed for task/job/workflow manager.
T143207: Evaluate software packages for job/task/workflow management
A note on terminology:
I use 'task' for a dump step (i.e. 'stubs', 'current page contents'), and 'job' for a piece of that step (i.e. stubs.xml3.gz). Think of jobs as subunits of tasks.
Analysts want the metadata about all pages early in the month for stats generation, so it's great to be able to schedule tasks that generate those (currently "stubs") on a given date via cron and have them get top priority over anything else that might be submitted by other cron jobs (currently wikidata dumps, cirrus search dumps, etc.)
The current dump of page contents requires that we have the metadata dump for those pages completed first. Then content for each revision is retrieved either from disk or from the database. While this mechanism might change in the rewrite, we'll likely still want to be able to specifify that one task depends on another having completed successfully.
We currently want dumps of e.g. revision content for all revisions. This gets done in little pieces. The way we split it into little pieces is quite horrendous and requires rebalancing of the numbers every so often. We should be able to specify a callback that will create a bunch of subjobs for a task, all of which take some reasonable length of time to run such that a rerun of any one of them is "cheap enough". We should also be able to specify a callback that will combine output from completed jobs into files to be published to the public for download/consumption.
Users like to know:
- when will the next dump for project X start?
- when will the current dump for project X finish?
- what tasks are complete for the current run of project X?
- what is the dump run for project X doing right now?
- tell me everything about all the latest dumps for language Y
- tell me everything about the latest dumps for project Z on all languages
Opsen (like me) want to know:
- what's broken?
- what errors were generated?
- when did the broken task(s) start?
- when did they stop/die?
- what's running on worker W?
- which jobs did worker W claim but they aren't running and aren't done either?
- which jobs have failed and how many retries have been made?
Job concurrency limit across all workers
Recently jcrespo noticed slave lag on the vslow/dumps database for s1 (en wp). It turned out that other tasks run at the beginning of the month in combination with the 27 page metadata dump jobs (stubs) was enough to push the server over the edge. When I configured the stubs to run in two batches with a max concurrency of 14 jobs, there was no lag. We'll want to be able to specify such a limit.
Most of the time, an output file for a job means that the job ran successfully, and we would want a rerun to do nothing. But sometimes the output is bad because the job was interrupted in the middle, the box rebooted, there was a network glitch, etc. On these cases we want to be able to force a rerun which would regenerate all the output. It would be nice to not have to go in and remove output files manually for that to happen.
We use one script for all the runs. I sure don't want to have to write a separate class for each task for each of 800-something projects, so we ought to be able to pass in parameters easily.
Support for tasks other than python scripts
Right now we run a number of bash scripts out of cron, e.g. wikidata dumps, cirrus search dumps, etc. I imagine that with the dumps reqrite we'll have even more folks wanting to generate weekly datasets of one sort or another. It would be nice to be able to easily convert these into tasks.
One of the things that saves us with the current dump setup is that the end 'stage' of the dump run generates dumps for 'all steps that have not completed successfully. This provides a chance for broken tasks to rerun. We'll want some sort of retry ability as well as the ability to bail after some max retries, to wait some length of time between retries, etc.
Distributed jobs across workers
Currently each task runs on one and only one server (snapshot host). For example, all 27 jobs that generate revision content for all revisions on en wikipedia run on the same snapshot host. It would be very nice indeed if any worker in the pool that had available resources could claim and run a job. In particular, we usually wind up waiting for 4 wikidata tasks to complete for a few days after everything else is done. Imagine if these could be split up properly into jobs and passed around to the other snapshot hosts to complete within a few hours instead. This would also allow us to take worker downtime or allow security updates with minimal impact to rump production.
I get asked for new dumps or one sort or another all the time. It would be nice if it were relatively easy for me and for others to add these new dumps as tasks. This biases me towards python as the platform language right away, or at the least that there should be a complete API for the platform in python. If other folks are happier in other languages, we might want support for them too (which ones?)
Sometimes I notice that a task is stuck and I just want to make it go away and reschedule it later. I've encountered this 3 out of the last 3 months in fact. This usually has to do with a bug in the code, but guess what, there will always be bugs. So the ability to have the manager shoot an existing task/jobs of that task, clean up the broken files and (potentially) leave any output files from completed jobs of that task, and (potentially) reschedule for a time of my chosing later, would sure be nice.
No duplicate jobs/tasks running at the same time
This should go without saying, but just in case. Right now we do a bunch of annoying locking etc to avoid this. The granularity isn't good enough to run multiple jobs for a task (or different tasks) at the same time, something that again is silly in a distributed execution framework. We shouldn't need locks at all.
A given job may take more resources than another. For example, the wikidata json dumps run several processes concurrently I believe. (Check this!) We should be able to specify the resources any job will use (CPU cores at least) so that workers will only claim jobs when they have sufficient resources available.
State recovery after crash
If the host goes down, the process, dies etc., it's a waste of time to rerun everything from the start. Although the XML/sql dumps are set up so that this can be done without harm, with completed tasks being skipped, it would be better if the manager picked up where it left off. This is especially important for dumps run weekly or daily out of cron, where they need to complete within a day.
We add tasks to run based on the order we want them to get done. This means we want metadata done early, we want current page content done pretty soon after that for all wikis, and page content history for all wikis can wait til near the end. In particular, we don't want to be locked into a situation where we specify tasks with dependencies for a wiki (DAG) and all elements in the DAG must run one after another.
Misc jobs that used to be run entirely out of cron could go in one queue, en wikipedia jobs in another, big wikis in a third, small wikis in a 4th (for example), so that dumps of small wikis don't get backed up behind everything else, so that the misc jobs are completed on the date they are submitted to the queue, etc.
API for job status checks
We don't want to rely on scaprig the output of some web interface in order to provide status updates to e.g. irc channels. Rather, we want an API so that we can generate RSS feeds, html files, irc channel or other updates automatically when a job or a task completes. See also Monitoring
The output of some tasks may be used as input to the next immediately, Example: page content production -> format conversion -> compression of various types.
It would be great if all error output or progress reports could be centrally collected rather than stashed in various places on each worker host. This might be out of scope i.e. perhaps the right solution to this is that every script for a dump task sends its logs via rsyslog to a logging host.
I've gone through all the past notes on the rewrite project, my notes in various places for the scheduler/workflow manager, and various other tickets on dumps improvement. Everything I could pull out of those is in the above list. If ANYONE HAS SOMETHING TO ADD please don't be shy. I expect to settle on a solution in mid-November, so there's plenty of time to get this right. Adding a couple projects with folks that might have some thoughts on this!
@ArielGlen : I think we shouldn't need to choose a scheduling solution when we already have one that works in a distributed computing setup: oozie.
Maybe you can get in touch with @Ottomata at the ops meeting about using hadoop oozie for your purposes. We just wrapped up our prototype to reconstruct edit history from the mediwaiki database, dumps can benefit from all that work. There is no need to deploy another parallel computing solution when computations can be dump in hadoop.
I'm very interested in this, though we dump lots of other data. What formats are available for export? Where can I look at code and data? Is there a playground in labs perhaps?
Also, I'll check in with Ottomata for sure, but maybe you would want to look at the list of requirements (T143206), or, alternatively, the list of packages already under consideration (T143207), and add/comment on oozie's capabilities?
Additionally, is there other work analytics plans to do that could be used perhaps for generating other dumps? Maybe it makes sense to move more of these things into the analytics platform and eventually hand them over to the team?
Is there a playground in labs perhaps?
no, there is not. We have tried to this before with little success.
What formats are available for export?
oozie is an scheduling system that runs code, you can make formats available for export to your code as needed be.
/comment on oozie's capabilities?
oozie can do pretty much everything you have outlined on this ticket. I think it will help if you get familiar with oozie, see how it is used and play with it for a while. There is plenty documentation online but here are some docs for our setup: https://wikitech.wikimedia.org/wiki/Analytics/Cluster/Oozie.
It requires access to 1002/2004 which you already must have
This task has been assigned to the same task owner for more than two years. Resetting task assignee due to inactivity, to decrease task cookie-licking and to get a slightly more realistic overview of plans. Please feel free to assign this task to yourself again if you still realistically work or plan to work on this task - it would be welcome!
For tips how to manage individual work in Phabricator (noisy notifications, lists of task, etc.), see https://phabricator.wikimedia.org/T228575#6237124 for available options.
(For the records, two emails were sent to assignee addresses before resetting assignees. See T228575 for more info and for potential feedback. Thanks!)