Page MenuHomePhabricator

Toolforge: beta phase for the new jobs framework
Open, MediumPublic

Description

The initial build out of the new Toolforge jobs framework has been deployed.

We move into the BETA phase, where we expect to:

  • collect feedback from beta testers
  • identify & correct flaws in the design
  • identify & correct general bugs
  • add new or missing features that may be interesting to the community

How to operate with the framework in this beta stage:

$ ssh dev-buster.toolforge.org
$ become $mytool
$ toolforge-jobs containers
Short name     Docker container image
-------------  ----------------------------------------------------------------------
tf-buster-std  docker-registry.tools.wmflabs.org/toolforge-buster-standalone:latest
tf-golang111   docker-registry.tools.wmflabs.org/toolforge-golang111-sssd-base:latest
tf-jdk11       docker-registry.tools.wmflabs.org/toolforge-jdk11-sssd-base:latest
tf-node10      docker-registry.tools.wmflabs.org/toolforge-node10-sssd-base:latest
tf-php73       docker-registry.tools.wmflabs.org/toolforge-php73-sssd-base:latest
tf-python37    docker-registry.tools.wmflabs.org/toolforge-python37-sssd-base:latest
tf-ruby25      docker-registry.tools.wmflabs.org/toolforge-ruby25-sssd-base:latest
tf-tcl86       docker-registry.tools.wmflabs.org/toolforge-tcl86-sssd-base:latest
wm-buster      docker-registry.tools.wmflabs.org/wikimedia-buster:latest
wm-stretch     docker-registry.tools.wmflabs.org/wikimedia-stretch:latest
[..]

$ # running a normal job:
$ toolforge-jobs run myjob --command ./mycommand.sh --image tf-buster-sd

$ # running a normal job and waiting for it to complete:
$ toolforge-jobs run myotherjob --command ./myothercommand.sh --image tf-buster-sd --wait

$ # running a continuous job:
$ toolforge-jobs run myalwaysrunningjob --command ./myendlesscommand.sh --image tf-buster-sd --continuous

$ # running a scheduled job:
$ toolforge-jobs run myscheduledjob --command ./everyminute.sh --image tf-buster-sd --schedule "1 * * * *"

$ toolforge-jobs list
Job name:       Command:                 Job type:            Container:     Status:
--------------  -----------------------  -------------------  -------------  ---------------------------
myscheduledjob  ./read-dumps.sh          schedule: * * * * *  tf-buster-std  Last schedule time: 2021-06-30T10:26:00Z
alwaysrunning   ./myendlesscommand.sh    continuous           tf-buster-std  Running
myjob           ./mycommand.sh           normla               tf-buster-std  Completed

$ toolforge-jobs show myscheduledjob
+------------+-----------------------------------------------------------------+
| Job name:  | myscheduledjob                                                  |
+------------+-----------------------------------------------------------------+
| Command:   | ./read-dumps.sh                                                 |
+------------+-----------------------------------------------------------------+
| Job type:  | schedule: * * * * *                                             |
+------------+-----------------------------------------------------------------+
| Container: | tf-buster-std                                                   |
+------------+-----------------------------------------------------------------+
| Status:    | Last schedule time: 2021-06-30T10:26:00Z                        |
+------------+-----------------------------------------------------------------+
| Hints:     | Last run at 2021-06-30T10:26:08Z. Pod in 'Pending' phase. State |
|            | 'waiting' for reason 'ContainerCreating'.                       |
+------------+-----------------------------------------------------------------+

$ toolforge-jobs delete myscheduledjob
$ toolforge-jobs flush
$ toolforge-jobs list
[.. nothing ..]

Related Objects

Event Timeline

aborrero triaged this task as Medium priority.Jul 1 2021, 3:33 PM

I’d love to use the new jobs framework for Wikidata Shape Expressions Inference, which currently runs as a Grid webservice so that it can schedule Grid jobs.

If toolforge-jobs is only available on dev-buster.toolforge.org – is the SSH fingerprint for that host available somewhere? (I don’t see it on Help:SSH Fingerprints yet, nor in the draft announcement.)

If toolforge-jobs is only available on dev-buster.toolforge.org – is the SSH fingerprint for that host available somewhere? (I don’t see it on Help:SSH Fingerprints yet, nor in the draft announcement.)

https://wikitech.wikimedia.org/wiki/Help:SSH_Fingerprints/dev-buster.toolforge.org

Thanks for the fingerprint.

  • Are standard output and error captured automatically, like on the Grid? I ran a short job, and I don’t see any corresponding .out/.err files, nor does the output appear in toolforge-jobs show.
  • It looks like it’s not possible to pass arguments to the command being executed; I assume we’re expected to put any nontrivial commands into a shell script and use the shell script as the command?
  • toolforge-jobs -h describes list and delete in terms of running jobs, but it looks like jobs stay in toolforge-jobs list after completing; are we expected to toolforge-jobs delete each job when we no longer care about it (or run toolforge-jobs flush)?
  • toolforge-jobs without arguments doesn’t seem to do anything useful (as far as I can tell, it’s more or less equivalent to sleep 1s). I would expect it to either default to one of the available subcommands (list seems like the most natural choice), or print a help message. (I guess that’s T285961 and the new version wasn’t rolled out yet?)
  • It looks like it’s not possible to pass arguments to the command being executed; I assume we’re expected to put any nontrivial commands into a shell script and use the shell script as the command?

You can add it in quotes, like --command "./my-command.py foo bar".

  • toolforge-jobs -h describes list and delete in terms of running jobs, but it looks like jobs stay in toolforge-jobs list after completing; are we expected to toolforge-jobs delete each job when we no longer care about it (or run toolforge-jobs flush)?

Sounds like a bug, since leaving them hanging around infinitely will create problems for the cluster as a whole.

  • toolforge-jobs without arguments doesn’t seem to do anything useful (as far as I can tell, it’s more or less equivalent to sleep 1s). I would expect it to either default to one of the available subcommands (list seems like the most natural choice), or print a help message. (I guess that’s T285961 and the new version wasn’t rolled out yet?)

Correct.

  • It looks like it’s not possible to pass arguments to the command being executed; I assume we’re expected to put any nontrivial commands into a shell script and use the shell script as the command?

You can add it in quotes, like --command "./my-command.py foo bar".

I tried that and it didn’t work:

starting container process caused: exec: "echo hi": executable file not found in $PATH: unknown'.

Is there a way to dump/export the toolforge-jobs config so it can be committed in Git? And then a way to import that config back. I tend to commit crontabs for documentation and transparency purposes.

Is there a way to dump/export the toolforge-jobs config so it can be committed in Git? And then a way to import that config back. I tend to commit crontabs for documentation and transparency purposes.

see T286492: toolforge-jobs: load jobs from a file

Mentioned in SAL (#wikimedia-cloud) [2021-07-15T15:55:46Z] <arturo> added toolforge-jobs-framework-cli_2_all.deb to buster-{tools,toolsbeta} (T285944)

Hello,

one of the things that I need for my normal Toolforge work is -cwd equivalent -- to run the job in the current directory, to avoid specifying the full path. Can that be done using the new framework?

Also, why doesn't it work under personal accounts? :-)

one of the things that I need for my normal Toolforge work is -cwd equivalent -- to run the job in the current directory, to avoid specifying the full path. Can that be done using the new framework?

Not yet. Can you create a subtask for that please?

Also, why doesn't it work under personal accounts? :-)

Because we don't provision Kubernetes credentials for normal users. You shouldn't be running any non-interactive processes with your personal account anyways (per rule 3).

Field report from trying to use toolforge-jobs for the wd-shex-infer tool:

  • I’m assuming that tools are expected to create jobs via the REST API, rather than using the toolforge-jobs command (which may or may not be available in their container). I eventually managed to figure out how to do this, mostly by reading the server source code (the Wikitech page is currently outdated), but I hope that this will be packaged up in nice, easy-to-use libraries for at least a few of the most popular programming languages on Toolforge (Python, PHP?, Rust?). The HTTP API itself isn’t too bad, the problem is more in the connection / session setup: in my current code I resorted to simply hard-coding various pieces of information (paths to the client certificate and key, API base URL) which the toolforge-jobs command apparently meticulously pieces together from multiple config files (/etc/toolforge-jobs-framework-cli.cfg, ~/.kube/config), and I don’t think every tool should have to implement the same config parsing.
  • Also, I can only assume that clients will not be required to turn off certificate verification to talk to the API by the time it goes out of beta.
  • The endpoint to create a job responds with the HTTP status code 200 OK and the string “201” in the body. It should probably return the status code 201 Created?
  • The information you can get about a job from the “show” or “list” endpoint doesn’t seem very useful. Most of the JSON fields are parameters that I sent when I created the job (i.e. I already knew those); the remaining ones look like free-form text that might be nice to show in toolforge-jobs output, but isn’t very pleasant to parse.
{"name": "echo-hi", "cmd": "echo hi", "image": "tf-buster-std", "user": "wdshexinfer", "namespace": "tool-wd-shex-infer", "filelog": "True", "status_short": "Running", "status_long": "Last run at 2021-07-23T19:03:04Z. Pod in 'Pending' phase. State 'waiting'. Reason 'ContainerCreating'."}
{"name": "echo-hi", "cmd": "echo hi", "image": "tf-buster-std", "user": "wdshexinfer", "namespace": "tool-wd-shex-infer", "filelog": "True", "status_short": "Completed", "status_long": "Last run at 2021-07-23T19:03:04Z. Pod in 'Succeeded' phase. State 'terminated'. Reason 'Completed'. Started at '2021-07-23T19:03:06Z'. Finished at '2021-07-23T19:03:07Z'. Exit code '0'."}
  • (cont. because ReMarkup doesn’t support multi-paragraph lists like Markdown does) The status_short field looks more useful, but according to the source code, for cronjobs that’s still a formatted string, not just one of several “enum” values.
    • (Side note: "user": "wdshexinfer"? Where did the hyphens go?)
  • On the other hand, if there was more information in that JSON, it wouldn’t necessarily be very useful, because it gets thrown away (garbage collected, cleaned up, T286108) after a pretty short time – one or two minutes, after I think. So if my tool only checks the job status when a web request comes in, any completed job is probably long gone by the time the tool tries to get its status. (But this is pretty much like on the Grid – this tool is only interested in “is the job finished or not”, so that’s okay.)
  • Ultimately, I probably can’t use toolforge-jobs for wd-shex-infer because no single image has all the programs needed by the tool. The tool is essentially a wrapper around a giant house of cards combining make, sed, jq, lsof, several Java programs, some Node programs, and possibly other runtimes I’ve forgotten. I managed to make it work on the Grid a few years ago, but it would require some more work to make it work on toolforge-jbs (possibly coordinating several jobs for the different parts), if it’s possible at all. But that’s not a shortcoming of toolforge-jobs, of course.

I’ve pushed my code to run and show jobs at R2390:af76aea06251: WIP: Run jobs on toolforge-jobs, if anyone’s interested.

  • I’m assuming that tools are expected to create jobs via the REST API, rather than using the toolforge-jobs command (which may or may not be available in their container).

I think the API is actually considered an implementation detail and users are expected to use the toolforge-jobs command from a bastion (currently only from the dev-buster.toolforge.org bastion). This assumption actually seems reasonable to me as this is seen as a replacement for both jsub and crontab and that functionality is only available from the bastions today for the grid as well. (It is technically possible to use jsub from a process running on the grid, but that is honestly so gross that I try and block it out from my mind.)

(It is technically possible to use jsub from a process running on the grid, but that is honestly so gross that I try and block it out from my mind.)

Then I’ve completely misunderstood this project, because my assumption was that it was supposed to provide a modern replacement for this functionality. (I also assumed that this was intentionally supported, but I don’t remember if that assumption was ever backed up by something. Current documentation on Wikitech doesn’t mention submitting grid jobs from webservices, as far as I can tell.)

I would suggest the opposite. The reason for the design decision of using an API is to allow multiple frontends in the future. As for whether or not you want to submit jobs from tool pods, that is dependent on having k8s credentials available for now since I don't think the API has implemented using the service account credentials. I would think that tickets about things like correcting http status codes during the beta will improve the API whether or not it ends up widely used by end users directly (though I think I know why that code happens).

Honestly, the ability to authenticate with service account credentials inside k8s isn't a bad feature request, though I don't think it's in the cards in the near future.

As for submitting grid jobs from web services, when I made that impossible, I was quickly asked to fix that by users of the functionality :)

@LucasWerkmeister I'd think it is totally reasonable to assume that the first thing we are concerned with is a viable replacement for cronjobs on the grid and one-off jobs submitted with jsub. If you really want to submit a job from k8s with APIs directly, you might find it easier to simply use an in-language k8s client and submit the job directly to k8s (since most clients automatically can handle the auth-by-config vs. auth-by-serviceaccount-in-pod dance). It requires more definition of the jobs, but once you are already going after the rest API directly, that's what k8s is anyway.

If we reach the point of wanting to strongly support direct API calls from webservices, we'd likely do well to set up a client library. Right now, I think it's reasonable to try it from your end and provide feedback, but I don't think you should expect that API to remain stable for now. The k8s API is, however, stable and versioned.