Page MenuHomePhabricator

Toolforge: beta phase for the new jobs framework
Open, MediumPublic

Description

The initial build out of the new Toolforge jobs framework has been deployed.

We move into the BETA phase, where we expect to:

  • collect feedback from beta testers
  • identify & correct flaws in the design
  • identify & correct general bugs
  • add new or missing features that may be interesting to the community

See live documentation at https://wikitech.wikimedia.org/wiki/Help:Toolforge/Jobs_framework

How to operate with the framework in this beta stage:

$ ssh dev-buster.toolforge.org
$ become $mytool
$ toolforge-jobs containers
Short name     Docker container image
-------------  ----------------------------------------------------------------------
tf-buster-std  docker-registry.tools.wmflabs.org/toolforge-buster-standalone:latest
tf-golang111   docker-registry.tools.wmflabs.org/toolforge-golang111-sssd-base:latest
tf-jdk11       docker-registry.tools.wmflabs.org/toolforge-jdk11-sssd-base:latest
tf-node10      docker-registry.tools.wmflabs.org/toolforge-node10-sssd-base:latest
tf-php73       docker-registry.tools.wmflabs.org/toolforge-php73-sssd-base:latest
tf-python37    docker-registry.tools.wmflabs.org/toolforge-python37-sssd-base:latest
tf-ruby25      docker-registry.tools.wmflabs.org/toolforge-ruby25-sssd-base:latest
tf-tcl86       docker-registry.tools.wmflabs.org/toolforge-tcl86-sssd-base:latest
wm-buster      docker-registry.tools.wmflabs.org/wikimedia-buster:latest
wm-stretch     docker-registry.tools.wmflabs.org/wikimedia-stretch:latest
[..]

$ # running a normal job:
$ toolforge-jobs run myjob --command ./mycommand.sh --image tf-buster-sd

$ # running a normal job and waiting for it to complete:
$ toolforge-jobs run myotherjob --command ./myothercommand.sh --image tf-buster-sd --wait

$ # running a continuous job:
$ toolforge-jobs run myalwaysrunningjob --command ./myendlesscommand.sh --image tf-buster-sd --continuous

$ # running a scheduled job:
$ toolforge-jobs run myscheduledjob --command ./everyminute.sh --image tf-buster-sd --schedule "1 * * * *"

$ toolforge-jobs list
Job name:       Command:                 Job type:            Container:     Status:
--------------  -----------------------  -------------------  -------------  ---------------------------
myscheduledjob  ./read-dumps.sh          schedule: * * * * *  tf-buster-std  Last schedule time: 2021-06-30T10:26:00Z
alwaysrunning   ./myendlesscommand.sh    continuous           tf-buster-std  Running
myjob           ./mycommand.sh           normla               tf-buster-std  Completed

$ toolforge-jobs show myscheduledjob
+------------+-----------------------------------------------------------------+
| Job name:  | myscheduledjob                                                  |
+------------+-----------------------------------------------------------------+
| Command:   | ./read-dumps.sh                                                 |
+------------+-----------------------------------------------------------------+
| Job type:  | schedule: * * * * *                                             |
+------------+-----------------------------------------------------------------+
| Container: | tf-buster-std                                                   |
+------------+-----------------------------------------------------------------+
| Status:    | Last schedule time: 2021-06-30T10:26:00Z                        |
+------------+-----------------------------------------------------------------+
| Hints:     | Last run at 2021-06-30T10:26:08Z. Pod in 'Pending' phase. State |
|            | 'waiting' for reason 'ContainerCreating'.                       |
+------------+-----------------------------------------------------------------+

$ toolforge-jobs delete myscheduledjob
$ toolforge-jobs flush
$ toolforge-jobs list
[.. nothing ..]

Related Objects

StatusSubtypeAssignedTask
Resolved JHedden
Resolvedaborrero
Resolvedaborrero
OpenNone
ResolvedJJMC89
Resolvedaborrero
ResolvedBUG REPORTaborrero
ResolvedBUG REPORTaborrero
ResolvedBUG REPORTaborrero
ResolvedFeatureaborrero
ResolvedBUG REPORTaborrero
ResolvedFeatureaborrero
Resolvedaborrero
Resolvedaborrero
ResolvedBstorm
InvalidBUG REPORTNone
ResolvedFeatureaborrero
Resolvedtaavi
OpenNone
ResolvedBUG REPORTSo9q
Resolvedaborrero
OpenBUG REPORTNone
OpenFeatureNone
OpenNone
OpenFeatureNone
DuplicateNone
OpenFeatureNone
ResolvedBUG REPORTJJMC89
OpenBUG REPORTNone
OpenFeatureNone
OpenBUG REPORTNone
ResolvedBUG REPORTtaavi
OpenFeatureNone
OpenFeatureNone
OpenFeatureNone
OpenFeatureNone
OpenFeatureNone
OpenFeatureNone
DuplicateBUG REPORTNone
DuplicateFeatureNone
OpenNone
OpenNone
Resolvedtaavi

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

Is there a way to dump/export the toolforge-jobs config so it can be committed in Git? And then a way to import that config back. I tend to commit crontabs for documentation and transparency purposes.

see T286492: toolforge-jobs: load jobs from a file

Mentioned in SAL (#wikimedia-cloud) [2021-07-15T15:55:46Z] <arturo> added toolforge-jobs-framework-cli_2_all.deb to buster-{tools,toolsbeta} (T285944)

Hello,

one of the things that I need for my normal Toolforge work is -cwd equivalent -- to run the job in the current directory, to avoid specifying the full path. Can that be done using the new framework?

Also, why doesn't it work under personal accounts? :-)

one of the things that I need for my normal Toolforge work is -cwd equivalent -- to run the job in the current directory, to avoid specifying the full path. Can that be done using the new framework?

Not yet. Can you create a subtask for that please?

Also, why doesn't it work under personal accounts? :-)

Because we don't provision Kubernetes credentials for normal users. You shouldn't be running any non-interactive processes with your personal account anyways (per rule 3).

Field report from trying to use toolforge-jobs for the wd-shex-infer tool:

  • I’m assuming that tools are expected to create jobs via the REST API, rather than using the toolforge-jobs command (which may or may not be available in their container). I eventually managed to figure out how to do this, mostly by reading the server source code (the Wikitech page is currently outdated), but I hope that this will be packaged up in nice, easy-to-use libraries for at least a few of the most popular programming languages on Toolforge (Python, PHP?, Rust?). The HTTP API itself isn’t too bad, the problem is more in the connection / session setup: in my current code I resorted to simply hard-coding various pieces of information (paths to the client certificate and key, API base URL) which the toolforge-jobs command apparently meticulously pieces together from multiple config files (/etc/toolforge-jobs-framework-cli.cfg, ~/.kube/config), and I don’t think every tool should have to implement the same config parsing.
  • Also, I can only assume that clients will not be required to turn off certificate verification to talk to the API by the time it goes out of beta.
  • The endpoint to create a job responds with the HTTP status code 200 OK and the string “201” in the body. It should probably return the status code 201 Created?
  • The information you can get about a job from the “show” or “list” endpoint doesn’t seem very useful. Most of the JSON fields are parameters that I sent when I created the job (i.e. I already knew those); the remaining ones look like free-form text that might be nice to show in toolforge-jobs output, but isn’t very pleasant to parse.
{"name": "echo-hi", "cmd": "echo hi", "image": "tf-buster-std", "user": "wdshexinfer", "namespace": "tool-wd-shex-infer", "filelog": "True", "status_short": "Running", "status_long": "Last run at 2021-07-23T19:03:04Z. Pod in 'Pending' phase. State 'waiting'. Reason 'ContainerCreating'."}
{"name": "echo-hi", "cmd": "echo hi", "image": "tf-buster-std", "user": "wdshexinfer", "namespace": "tool-wd-shex-infer", "filelog": "True", "status_short": "Completed", "status_long": "Last run at 2021-07-23T19:03:04Z. Pod in 'Succeeded' phase. State 'terminated'. Reason 'Completed'. Started at '2021-07-23T19:03:06Z'. Finished at '2021-07-23T19:03:07Z'. Exit code '0'."}
  • (cont. because ReMarkup doesn’t support multi-paragraph lists like Markdown does) The status_short field looks more useful, but according to the source code, for cronjobs that’s still a formatted string, not just one of several “enum” values.
    • (Side note: "user": "wdshexinfer"? Where did the hyphens go?)
  • On the other hand, if there was more information in that JSON, it wouldn’t necessarily be very useful, because it gets thrown away (garbage collected, cleaned up, T286108) after a pretty short time – one or two minutes, after I think. So if my tool only checks the job status when a web request comes in, any completed job is probably long gone by the time the tool tries to get its status. (But this is pretty much like on the Grid – this tool is only interested in “is the job finished or not”, so that’s okay.)
  • Ultimately, I probably can’t use toolforge-jobs for wd-shex-infer because no single image has all the programs needed by the tool. The tool is essentially a wrapper around a giant house of cards combining make, sed, jq, lsof, several Java programs, some Node programs, and possibly other runtimes I’ve forgotten. I managed to make it work on the Grid a few years ago, but it would require some more work to make it work on toolforge-jbs (possibly coordinating several jobs for the different parts), if it’s possible at all. But that’s not a shortcoming of toolforge-jobs, of course.

I’ve pushed my code to run and show jobs at R2390:af76aea06251: WIP: Run jobs on toolforge-jobs, if anyone’s interested.

  • I’m assuming that tools are expected to create jobs via the REST API, rather than using the toolforge-jobs command (which may or may not be available in their container).

I think the API is actually considered an implementation detail and users are expected to use the toolforge-jobs command from a bastion (currently only from the dev-buster.toolforge.org bastion). This assumption actually seems reasonable to me as this is seen as a replacement for both jsub and crontab and that functionality is only available from the bastions today for the grid as well. (It is technically possible to use jsub from a process running on the grid, but that is honestly so gross that I try and block it out from my mind.)

(It is technically possible to use jsub from a process running on the grid, but that is honestly so gross that I try and block it out from my mind.)

Then I’ve completely misunderstood this project, because my assumption was that it was supposed to provide a modern replacement for this functionality. (I also assumed that this was intentionally supported, but I don’t remember if that assumption was ever backed up by something. Current documentation on Wikitech doesn’t mention submitting grid jobs from webservices, as far as I can tell.)

I would suggest the opposite. The reason for the design decision of using an API is to allow multiple frontends in the future. As for whether or not you want to submit jobs from tool pods, that is dependent on having k8s credentials available for now since I don't think the API has implemented using the service account credentials. I would think that tickets about things like correcting http status codes during the beta will improve the API whether or not it ends up widely used by end users directly (though I think I know why that code happens).

Honestly, the ability to authenticate with service account credentials inside k8s isn't a bad feature request, though I don't think it's in the cards in the near future.

As for submitting grid jobs from web services, when I made that impossible, I was quickly asked to fix that by users of the functionality :)

@LucasWerkmeister I'd think it is totally reasonable to assume that the first thing we are concerned with is a viable replacement for cronjobs on the grid and one-off jobs submitted with jsub. If you really want to submit a job from k8s with APIs directly, you might find it easier to simply use an in-language k8s client and submit the job directly to k8s (since most clients automatically can handle the auth-by-config vs. auth-by-serviceaccount-in-pod dance). It requires more definition of the jobs, but once you are already going after the rest API directly, that's what k8s is anyway.

If we reach the point of wanting to strongly support direct API calls from webservices, we'd likely do well to set up a client library. Right now, I think it's reasonable to try it from your end and provide feedback, but I don't think you should expect that API to remain stable for now. The k8s API is, however, stable and versioned.

Field report from trying to use toolforge-jobs for the wd-shex-infer tool:

Honestly @LucasWerkmeister thanks for your detailed report. Really appreciated :-)

Yes the new API still requires work to make it more robust before we leave the beta phase.

I don't have a lot to add to the conversation, just +1 to everything @bd808 and @Bstorm said. Yes, the toolforge-jobs API can be used directly, but if you are into using REST APIs then I would suggest you use the k8s one directly.

The new jobs thing provides a nicely abstracted API that then creates problems because is too abstracted :-P

I should probably take the time to write some docs that helps tool maintainers understand how to interact with the service, something like:

Kind of interactionWhat to useFrom whereNotesDocs link
humantoolforge-jobs CLIbastionsprimary human interfaceTBD
programmaticallyk8s APIanywhere in Toolforgeprimary code interfaceTBD
programmaticallytoolforge-jobs APIanywhere in Toolforgerecommended only for toolforge administratorsTBD

Mentioned in SAL (#wikimedia-cloud) [2021-09-06T12:16:00Z] <arturo> creating an empty k8s job to debug naming issues reported in T285944

Change 719102 had a related patch set uploaded (by Arturo Borrero Gonzalez; author: Arturo Borrero Gonzalez):

[cloud/toolforge/jobs-framework-api@main] jobs: correctly parse user name if it contains hypens

https://gerrit.wikimedia.org/r/719102

Change 719113 had a related patch set uploaded (by Arturo Borrero Gonzalez; author: Arturo Borrero Gonzalez):

[cloud/toolforge/jobs-framework-api@main] k8sclient: refresh comment about cronjob versions

https://gerrit.wikimedia.org/r/719113

Change 719102 merged by jenkins-bot:

[cloud/toolforge/jobs-framework-api@main] job: correctly parse user name if it contains hypens

https://gerrit.wikimedia.org/r/719102

Change 719113 merged by jenkins-bot:

[cloud/toolforge/jobs-framework-api@main] k8sclient: refresh comment about cronjob versions

https://gerrit.wikimedia.org/r/719113

I succeeded with help from Cloud Services Support to get https://github.com/dpriskorn/ItemSubjector/tree/prepare-batch working

bild.png (1×1 px, 79 KB)

The commands I ran was:
$ python3 itemsubjector/itemsubjector.py -l Q1148337 --prepare-jobs
$ toolforge-jobs run job5 --image tf-python39 --command "./setup.sh && python3 ~/itemsubjector/itemsubjector.py -r"

where setup.sh contains:

$ cat setup.sh
cd ~/itemsubjector
pip install -r requirements.txt

(It is technically possible to use jsub from a process running on the grid, but that is honestly so gross that I try and block it out from my mind.)

I think I have a legit usecase for this. I have a continuous process that writes to a log at least once every 15 mins. So to check if it's working alright, I use a cronjob that every 30 mins checks if the log looks as expected. If not, it restarts the job (code – here npm restart invokes jstart).

(It is technically possible to use jsub from a process running on the grid, but that is honestly so gross that I try and block it out from my mind.)

I think I have a legit usecase for this. I have a continuous process that writes to a log at least once every 15 mins. So to check if it's working alright, I use a cronjob that every 30 mins checks if the log looks as expected. If not, it restarts the job (code – here npm restart invokes jstart).

This will be covered out of the box with the new jobs framework. Just create a jobs with --continuous and kubernetes will make sure there is always a pod running with your code. If it dies for whatever reason, the pod will respawn.

(It is technically possible to use jsub from a process running on the grid, but that is honestly so gross that I try and block it out from my mind.)

I think I have a legit usecase for this. I have a continuous process that writes to a log at least once every 15 mins. So to check if it's working alright, I use a cronjob that every 30 mins checks if the log looks as expected. If not, it restarts the job (code – here npm restart invokes jstart).

This will be covered out of the box with the new jobs framework. Just create a jobs with --continuous and kubernetes will make sure there is always a pod running with your code. If it dies for whatever reason, the pod will respawn.

Restarting on process failure is good, but a bit different than checking for a hung process that has not yet exited. A nice way to do monitoring on a live job in a Kubernetes context would be a "liveness" probe in the pod's configuration that Kubernetes itself would check and when it fails trigger a pod replacement. This is the sort of thing that Kubernetes is pretty good at, but something that we have not made easy to configure yet with our support tooling.

Restarting on process failure is good, but a bit different than checking for a hung process that has not yet exited. A nice way to do monitoring on a live job in a Kubernetes context would be a "liveness" probe in the pod's configuration that Kubernetes itself would check and when it fails trigger a pod replacement. This is the sort of thing that Kubernetes is pretty good at, but something that we have not made easy to configure yet with our support tooling.

Ok I see. I think adding support for that in the framework should be relatively easy. Just another option users can enable on demand.

Perhaps something we can introduce on a later iteration.

I should probably take the time to write some docs that helps tool maintainers understand how to interact with the service, …

Yes, definitely. I've inherited a moderately big/complex (collection of) community tools that have been essentially unmaintained for half a decade, and among the many needed modernisations is the fact that everywhere I look I'm being told old job grid is old and you should use k8 but k8 doesn't support running grid-ish jobs yet. Since I'm in "modernisation mode", migrating to the New Way™ is high up my list and the docs aren't exactly helping me just now.

Incidentally, it was sheer random chance that I happened across this task. If we're serious that this is now at beta maturity level and k8-backed job running is roughly on par with the Job Grid for reliability, Wikitech's Toolforge pages should be hawking the beta prominently anywhere the job grid is mentioned. "Please start testing. This will be the one true way once it's done."

  • Ultimately, I probably can’t use toolforge-jobs for wd-shex-infer because no single image has all the programs needed by the tool.

Some general thoughts on this, based partially on running into the same kind of issue for a tool:

Image design tends to automatically lean towards being very DevOps-y, where you want the images to be lean, optimised, and specialised for the pre-defined role or workload; of which the assumption is that there will be a finite and, relatively speaking, small number. That approach is somewhat mismatched with a general purpose type platform like Toolforge. For that I would more expect a rich basic platform with various more abstract axis of variation. Primarily "long term support", "stable", "bleeding edge"; and possibly "long running computationally expensive" vs. "low latency" (i.e. a "OLAP" vs. "OLTP" type distinction). Or maybe even "Needs Internet access", "Needs wiki access", and "Needs no access" (call it "security posture" or "capability" or whatever). "Etc."

A more purely DevOps-y approach will tend to encourage lots and lots of locally built deps living in tool account home dirs, with duplication between tools, no real security patching, and far less robustness over time (only the maintainer(s) of the tool can patch any deps). I would assert that from a sustainability and resilience perspective, the best practice should be to use as many platform-provided facilities as possible; which in turn means the platform should prioritise richness over leanness.

If your particular need can be met with a very lean image, so much the better. But I think that for a lot of tools they'd really want to choose among "toolforge-compat", "toolforge-stable", and "toolforge-edge"; rather than picking between node, python, and php images ("What if my tool needs all three?").

I haven't checked what's actually included in, say, the buster image listed in the task description, so it may in fact be a completely reasonable tradeoff, but until I get a chance to test them out I'll be worried about whether I'll end up having to maintain a metric boatload of locally-built deps for my tool if I migrate it from the old job grid.

I haven't checked what's actually included in, say, the buster image listed in the task description, so it may in fact be a completely reasonable tradeoff, but until I get a chance to test them out I'll be worried about whether I'll end up having to maintain a metric boatload of locally-built deps for my tool if I migrate it from the old job grid.

The current containers are "lean" and will likely remain that way. Many, many years ago we dreamed of a container that was functionally the full supported runtime from grid engine but gave that idea up when we hit the (at that time) 2 GiB container size limit with hundreds of packages left to add. Modern container build and run tools no longer have the 2GiB size limit, but a 4+ GiB image is really not reasonable for wide spread use in our cluster.

The future fix for per-tool image customization likely to be an outcome of T267374: Set up a Toolforge buildpack CI pipeline as a POC and related work. Until we have something like buildpacks or another per-tool custom container solution we will continue to maintain the grid engine environment to provide support for more "complex" jobs and webservices which need specialized libraries and other binaries that would be container bloat for the majority of tools.

So yes, you will hear WMCS admins and others say "eww, can you use Kubernetes instead" when asked about a new tool adopting grid engine, but in reality the grid will remain for the currently forseeable future. We will have loud announcements (and possibly dancing in the streets by SREs) when we reach a point of broad feature parity and can announce a grid engine removal date, but that date will also very likely include 6-12 months for everyone to migrate their remaining grid engine workloads. Nobody really wants to be responsible for operating a grid engine deployment in 2021, but it allows y'all, the Wikimedia technical community, to do important work for the projects and that is something we do extra work to preserve.

Switching to Kubernetes jobs is a good idea for the tools that fit their constraints today: single language runtime and limited compiled libraries needed. If your tool is a php or python script that does some bot work on the wikis it will likely work for you. If your tool needs specialized image processing libraries, large amounts of fonts, esoteric runtime languages, etc then you are probably not a good fit for migrating to Kubernetes until we have some more core platform features in place.

[lots of sensible things]

That's a nice and clear summation. Could we stash that on Wikitech somewhere visible? Regardless of whether one is particularly happy about all details of it, it is a clearly articulated policy / strategy (take your pick) to use as a foundation for decisions. I can plan based on that (and grumble at its inconveniences elsewhere).

Or put another way, that sort of "here's where we are, here's where we're going" is somewhat a necessary prerequisite for the "Here's how you adjust the mem size in jsub" type docs; and every time I've gone searching for the former I've only found the latter.

Change 762811 had a related patch set uploaded (by Arturo Borrero Gonzalez; author: Arturo Borrero Gonzalez):

[operations/puppet@production] toolforge: automated-tests: introduce some TJF checks

https://gerrit.wikimedia.org/r/762811

Change 762811 merged by Arturo Borrero Gonzalez:

[operations/puppet@production] toolforge: automated-tests: introduce some TJF checks

https://gerrit.wikimedia.org/r/762811