Page MenuHomePhabricator

Set up a system for team-managed command-line jobs
Closed, ResolvedPublic

Description

Thanks to T258970 and T267940, we have a system for running team-managed Oozie jobs, such that everyone on the team receives alerts and has privileges to manage them and no work is necessary if the main author of the job leaves the Foundation.

However, although Oozie is relatively well-suited to ETL operations on the Data Lake, it is not a good fit for automating operations that would be run on Analytics clients, like running and publishing a Jupyter notebook. We would normally set up a cron job (or possibly, as Luca suggested, a systemd timer). We should come up with a similar system for making such jobs ("command-line jobs", for lack of a better term) team-managed.

However, it may be better to wait for T267940, since whatever replaces Oozie (likely Airflow) probably will work well for command-line jobs.

A Slack conversation that's useful background:

@nshahquinn-wmf: What do you think about us just picking one analytics client and deciding that we'll run all our crons as analytics-product on that machine? There might be some risk of overload on that machine, but I don't expect us to have that many crons and the ones we have will mostly just be querying the Data Lake and then publishing the output to the web. We should wait for @mpopov's take, though.
@elukey: This is a very good thing to discuss, I think it is worth a task. I have been thinking about this for a while, since after the big stat100x re-unification project stat1007 was left running some per-team specific crons/tasks (see profile::statistics::explorer::misc_jobs) that I always wanted to get rid of, since stat boxes should be (in my mind) only client nodes. For analytics-owned systemd timers (that are our version of cron with more logging/alarming/etc.., I suggest it) we use a dedicated host, an-launcher1002, so we could do the same if your team needs a place to run timers too. For example, we could create a small VM running profile::statistics::explorer::misc_jobs, and add to puppet a basic structure of the things that your team needs to run. I know that puppet is not appealing but it would avoid to loose everything in case a os reinstall happens, or if a host needs to be replaced etc.. What do you think?
@mpopov: The biggest problem with puppet for us is package management. We rely heavily on R and Python packages, so for us it largely comes down to what is the process for installing and using new packages? Any solution/process we employ has to support free & flexible package management, whether they're packages we develop & maintain ourselves and host on Gerrit or other open source ones that need to be installed from places like GitHub. Hence the idea to pick one stat100X host and just run everything from there as analytics-product user, since package management at that point is no different than what we already do with our individual user accounts.
For packet management I think that we could use a dedicated gerrit repository, to be deployed via scap when needed.. Or we could possibly force puppet to deploy the last version to a node, and have systemd timer rely on it. In this way everything would be "movable" from one host to the other, and Analytics would be aware of your needs.. We can also think about using a stat100x approach, but it will be less visible for us when we do maintenance. If this is fine I think that we can proceed, and review it in the future if needed 🙂

Event Timeline

LGoto triaged this task as Medium priority.Jan 12 2021, 6:13 PM
LGoto moved this task from Triage to Backlog on the Product-Analytics board.

We discussed this in a meeting and we currently have at least 5 different jobs being run by different team members, so there would be real value to getting this done.