This includes everything from 'able to accept one-off commandline requests for runs of a batch script' to 'can delete the en wp stubs part X output and schedule for rerun' to 'limit run of en wp jobs to no more than 14 at a time for phase Y' to 'show status of arbitrary group of job(s) via http(s) in json/html format' etc.
Description
Status | Subtype | Assigned | Task | ||
---|---|---|---|---|---|
Open | None | T128513 Dumps 2.0 Platform design questions | |||
Open | None | T128520 What already available software can we build on for a job scheduler? | |||
Stalled | None | T146070 Review progress of dumps rewrite | |||
Open | None | T143205 Draft usage scenarios for job/workflow manager | |||
Open | None | T143206 List requirements needed for task/job/workflow manager. | |||
Open | None | T143207 Evaluate software packages for job/task/workflow management |
Event Timeline
I've carried over the list of features identified so far from the usage scenarios (T143205) and have made a stab at prioritizing them here.
Ratings are from 1 - 10 where 10 is "I would really really hate to give this up. Missing it probably a blocker." and 1 is "Ah, doesn't have it? Shrug"
priority | feature |
---|---|
3 | Multiple queues with priorities assigned to each |
4 | Centralized logging |
5 | Chained jobs (output of one job as input to the next, entire chain may be run at once) |
5 | Rerun after removing all previous output |
5 | Task/job deletion |
5 | No duplicate jobs/tasks running at the same time |
6 | Support for tasks other than python scripts |
7 | Job concurrency limit across all workers |
7 | Configurable retries |
8 | Task priorities |
8 | Task dependencies |
8 | Written in a language ops knows and can support (Python preferred) |
8 | Resource specification/management (at least CPU cores per job) |
8 | FIFO-ish (queue processing mostly in order submitted) |
10 | Map/reduce hooks |
10 | Monitoring (see table below) |
10 | Task arguments |
10 | Distributed jobs across workers per task |
10 | API for extending functionality (Python preferred) |
10 | State recovery after crash (implies persistent storage of job statuses) |
10 | API for job status checks |
Monitoring gets its own section here, ratings from 1 to 5. None of these are blockers but the greater the total, the better.
priority | available for monitoring |
---|---|
1 | Estimated start time for a job or task |
1 | Estimated completion time for a job or task |
3 | Show completed jobs or tasks for a given time frame (restrict on language or wiki type) |
3 | Current status of jobs/tasks for a wiki/all projects of given language, etc |
5 | Temporary/permanent failure of job/task, number of retries done/left, error output, start/end time |
5 | Jobs/tasks on given worker(s) (restrict on language/wiki type) |
5 | Jobs/tasks claimed but not run |
Additional wants listed here without explicit priorities:
- stable code base/apis
- responsive upstream developers/maintainers
- healthy support community
- user auth for web status info if there's anything sensitive
- encryption between hosts for anything sensitive
This task has been assigned to the same task owner for more than two years. Resetting task assignee due to inactivity, to decrease task cookie-licking and to get a slightly more realistic overview of plans. Please feel free to assign this task to yourself again if you still realistically work or plan to work on this task - it would be welcome!
For tips how to manage individual work in Phabricator (noisy notifications, lists of task, etc.), see https://phabricator.wikimedia.org/T228575#6237124 for available options.
(For the records, two emails were sent to assignee addresses before resetting assignees. See T228575 for more info and for potential feedback. Thanks!)