Page MenuHomePhabricator

List requirements needed for task/job/workflow manager.
Open, HighPublic

Description

This includes everything from 'able to accept one-off commandline requests for runs of a batch script' to 'can delete the en wp stubs part X output and schedule for rerun' to 'limit run of en wp jobs to no more than 14 at a time for phase Y' to 'show status of arbitrary group of job(s) via http(s) in json/html format' etc.

Event Timeline

I've carried over the list of features identified so far from the usage scenarios (T143205) and have made a stab at prioritizing them here.

Ratings are from 1 - 10 where 10 is "I would really really hate to give this up. Missing it probably a blocker." and 1 is "Ah, doesn't have it? Shrug"

priorityfeature
3Multiple queues with priorities assigned to each
4Centralized logging
5Chained jobs (output of one job as input to the next, entire chain may be run at once)
5Rerun after removing all previous output
5Task/job deletion
5No duplicate jobs/tasks running at the same time
6Support for tasks other than python scripts
7Job concurrency limit across all workers
7Configurable retries
8Task priorities
8Task dependencies
8Written in a language ops knows and can support (Python preferred)
8Resource specification/management (at least CPU cores per job)
8FIFO-ish (queue processing mostly in order submitted)
10Map/reduce hooks
10Monitoring (see table below)
10Task arguments
10Distributed jobs across workers per task
10API for extending functionality (Python preferred)
10State recovery after crash (implies persistent storage of job statuses)
10API for job status checks

Monitoring gets its own section here, ratings from 1 to 5. None of these are blockers but the greater the total, the better.

priorityavailable for monitoring
1Estimated start time for a job or task
1Estimated completion time for a job or task
3Show completed jobs or tasks for a given time frame (restrict on language or wiki type)
3Current status of jobs/tasks for a wiki/all projects of given language, etc
5Temporary/permanent failure of job/task, number of retries done/left, error output, start/end time
5Jobs/tasks on given worker(s) (restrict on language/wiki type)
5Jobs/tasks claimed but not run

Additional wants listed here without explicit priorities:

  • stable code base/apis
  • responsive upstream developers/maintainers
  • healthy support community
  • user auth for web status info if there's anything sensitive
  • encryption between hosts for anything sensitive

This task has been assigned to the same task owner for more than two years. Resetting task assignee due to inactivity, to decrease task cookie-licking and to get a slightly more realistic overview of plans. Please feel free to assign this task to yourself again if you still realistically work or plan to work on this task - it would be welcome!

For tips how to manage individual work in Phabricator (noisy notifications, lists of task, etc.), see https://phabricator.wikimedia.org/T228575#6237124 for available options.
(For the records, two emails were sent to assignee addresses before resetting assignees. See T228575 for more info and for potential feedback. Thanks!)