Page MenuHomePhabricator

Discuss common needs in a job manager/scheduler
Open, MediumPublic


Folks are looking into job manager/scheduler software for use with analytics, search, dumps, CI. Can we find a solution good enough for us all? We won't know until we discuss our common needs.

Let's each summarize in a couple paragraphs our basic needs. An IRC discussion will follow, the content of which will also be summarized on this task.

Related tasks:

Also: CI Seakeeper Proposal

Event Timeline

ArielGlenn triaged this task as Medium priority.Nov 5 2019, 8:52 AM
ArielGlenn created this task.

For RelEng and CI, is long, but the most important parts are introduction and requirements, which aren't all that long.

Shamelessly stealing @EBernhardson's email:

I have not fully documented these anywhere, truly i was intending to simply install airflow for our use cases, and let other people solve theirs in a way that seems most fit to purpose. In our case a compute cluster already exists that runs our jobs (hadoop/yarn), and that compute cluster has a workflow orchestrator (oozie). The need to run our tasks on pre-existing compute essentially eliminates Argo from our possible ways forward, and why I haven't investigated it too closely.

Tasks that we currently schedule:

  • Wait for source data to become available from external process (eventgate, camus, another DAG task, etc.), run some computation over it, write outputs somewhere (hdfs, druid, swift, etc). This primarily works through expected directory locations parameterized on the run schedule, and monitoring hdfs for files to appear where expected. These tasks typically operate over all wikis at the same time.
  • Orchestrate an ML workflow. This involves primarily chaining inputs and outputs for a number of different algorithms to build up a variety of intermediate data structures on hdfs. Some outputs are reused, some computations may not be required depending on the configuration supplied. Some tasks operate on all projects at once, while others need to run a task per-project (eswiki, plwiki, etc). Additionally later jobs in the pipeline, particularly model training, need to look at the input data sizes when submitting jobs to the compute cluster, to request an appropriate amount of memory to load the feature vector matrices. Lots of flexibility is needed here to allow quick experimentation. Algorithms invoked should be easily replaceable with others that emit the same outputs regardless of inputs.

The first set of tasks is reasonably solved by the existing system, oozie, the second not so much. Oozie uses static XML templates for their jobs, allowing templating of the content/attributes of the XML but not the structure of the XML itself. To me this looks very similar to argo's YAML, and this is pretty much the exact opposite of the flexibility we need with the ML workflow. The only way to, for example, perform per-wiki jobs in oozie is to define a per-wiki workflow and copy/paste including that workflow parameterized on each wiki you want. This can't even use a for loop, so there is no guarantee that the list of wikis passed into early tasks to use as a filter is the same as the set of per-wiki workflows created. We investigated writing a layer over the top that wrote out oozie XML files instead of writing XML by hand, but the additional complexity (especially as compared to airflow) is a big downside.

Additional notes:

The system needs to be testable (one of the limitation we have with Oozie at the moment).

ArielGlenn updated the task description. (Show Details)
ArielGlenn updated the task description. (Show Details)

Sorry for that poor task editing. And now:

In two paragraphs, main wants/needs for the dumps

We need to be able to run subjobs across multiple worker hosts, check on their progress via an api, allow community members to check on their progress (ideally with an ETA), and recombine results of these subjobs into larger output files while keeping the subjob results around too. An API that is flexible enough for that, accessible via Python, or that can run arbitrary commands (MediaWiki scripts, for example) is pretty much a necessity.

If any process dies we want the ability to reuse intermediate files as desired, rather than retrying a job from the beginning, and maintaining multiple queues with some sort of prioritizing of which things get run in which order is also pretty much a hard requirement, so that everything gets done on time even when things break.

A longer list is available on T143206#2626989 but you don't need to wade through that; if we get our top requirements down and compare them, that should be enough to tell us if a common solution is a go or no-go.

Unauthorized summary of Thursday's IRC chat (please poke me to edit anything important I missed, or mischaracterized)

What we need, what we'll look at, timeframe

  • We are all interested in more than just a job manager/scheduler; we need something that handles workflows
  • The two top contenders are Argo (for CI) and Airflow (for everyone else)
  • Airflow does not have any CI component; its use for CI would require building that from scratch (prohibitive)
  • Analytics and Search both need to use data in HDFS (and store results there?); Dumps mid-term wants this too
  • Releng will move ahead with Argo, the rest of us will look at Argo and see if it's close enough to work for us, compared to Airflow, or not
  • Analytics implementation of whichever choice probably won't happen until next fiscal year, this sets the joint timeframe for preliminary work

Other notes

  • Argo and Airflow both support DAGs and k8s
  • Argo requires k8s, Airflow does not although it can use it
  • If we do all use k8s (either with Airflow or Argo), we likely want more than one cluster
  • Custom resource definition examples for Argo:
  • marxarelli volunteered to be a resource person for Argo and has submitted patches upstream

It is possible that Argo will not be the CI choice, see this discussion: That is, it is possible that a third party service might be chosen that is not necessarily Argo. I don't know how likely that is, and there are not further docs nor tasks about this (yet). I'd like to wait to see this settle a bit before we do an in depth Argo eval for other purposes.