Page MenuHomePhabricator

Toolforge: beta phase for the new jobs framework
Closed, ResolvedPublic

Assigned To
Authored By
aborrero
Jul 1 2021, 12:38 PM
Referenced Files
F34651544: bild.png
Sep 23 2021, 10:57 AM
Tokens
"Love" token, awarded by Don-vip."Love" token, awarded by MusikAnimal."Love" token, awarded by So9q."Love" token, awarded by Krinkle.

Description

The initial build out of the new Toolforge jobs framework has been deployed.

We move into the BETA phase, where we expect to:

  • collect feedback from beta testers
  • identify & correct flaws in the design
  • identify & correct general bugs
  • add new or missing features that may be interesting to the community

See live documentation at https://wikitech.wikimedia.org/wiki/Help:Toolforge/Jobs_framework

How to operate with the framework in this beta stage:

$ ssh dev-buster.toolforge.org
$ become $mytool
$ toolforge-jobs containers
Short name     Docker container image
-------------  ----------------------------------------------------------------------
tf-buster-std  docker-registry.tools.wmflabs.org/toolforge-buster-standalone:latest
tf-golang111   docker-registry.tools.wmflabs.org/toolforge-golang111-sssd-base:latest
tf-jdk11       docker-registry.tools.wmflabs.org/toolforge-jdk11-sssd-base:latest
tf-node10      docker-registry.tools.wmflabs.org/toolforge-node10-sssd-base:latest
tf-php73       docker-registry.tools.wmflabs.org/toolforge-php73-sssd-base:latest
tf-python37    docker-registry.tools.wmflabs.org/toolforge-python37-sssd-base:latest
tf-ruby25      docker-registry.tools.wmflabs.org/toolforge-ruby25-sssd-base:latest
tf-tcl86       docker-registry.tools.wmflabs.org/toolforge-tcl86-sssd-base:latest
wm-buster      docker-registry.tools.wmflabs.org/wikimedia-buster:latest
wm-stretch     docker-registry.tools.wmflabs.org/wikimedia-stretch:latest
[..]

$ # running a normal job:
$ toolforge-jobs run myjob --command ./mycommand.sh --image tf-buster-sd

$ # running a normal job and waiting for it to complete:
$ toolforge-jobs run myotherjob --command ./myothercommand.sh --image tf-buster-sd --wait

$ # running a continuous job:
$ toolforge-jobs run myalwaysrunningjob --command ./myendlesscommand.sh --image tf-buster-sd --continuous

$ # running a scheduled job:
$ toolforge-jobs run myscheduledjob --command ./everyminute.sh --image tf-buster-sd --schedule "1 * * * *"

$ toolforge-jobs list
Job name:       Command:                 Job type:            Container:     Status:
--------------  -----------------------  -------------------  -------------  ---------------------------
myscheduledjob  ./read-dumps.sh          schedule: * * * * *  tf-buster-std  Last schedule time: 2021-06-30T10:26:00Z
alwaysrunning   ./myendlesscommand.sh    continuous           tf-buster-std  Running
myjob           ./mycommand.sh           normla               tf-buster-std  Completed

$ toolforge-jobs show myscheduledjob
+------------+-----------------------------------------------------------------+
| Job name:  | myscheduledjob                                                  |
+------------+-----------------------------------------------------------------+
| Command:   | ./read-dumps.sh                                                 |
+------------+-----------------------------------------------------------------+
| Job type:  | schedule: * * * * *                                             |
+------------+-----------------------------------------------------------------+
| Container: | tf-buster-std                                                   |
+------------+-----------------------------------------------------------------+
| Status:    | Last schedule time: 2021-06-30T10:26:00Z                        |
+------------+-----------------------------------------------------------------+
| Hints:     | Last run at 2021-06-30T10:26:08Z. Pod in 'Pending' phase. State |
|            | 'waiting' for reason 'ContainerCreating'.                       |
+------------+-----------------------------------------------------------------+

$ toolforge-jobs delete myscheduledjob
$ toolforge-jobs flush
$ toolforge-jobs list
[.. nothing ..]

Related Objects

StatusSubtypeAssignedTask
Resolved JHedden
Resolvedaborrero
Resolvedaborrero
Resolvedaborrero
ResolvedJJMC89
Resolvedaborrero
ResolvedBUG REPORTaborrero
ResolvedBUG REPORTaborrero
ResolvedBUG REPORTaborrero
ResolvedFeatureaborrero
ResolvedBUG REPORTaborrero
ResolvedFeatureaborrero
Resolvedaborrero
Resolvedaborrero
ResolvedBstorm
InvalidBUG REPORTNone
ResolvedFeatureaborrero
Resolvedtaavi
ResolvedBUG REPORTSo9q
Resolvedaborrero
ResolvedFeatureRaymond_Ndibe
ResolvedFeatureRaymond_Ndibe
DuplicateNone
DuplicateFeatureRaymond_Ndibe
ResolvedBUG REPORTJJMC89
ResolvedBUG REPORTRaymond_Ndibe
ResolvedBUG REPORTaborrero
ResolvedBUG REPORTtaavi
ResolvedFeaturetaavi
ResolvedFeaturetaavi
DuplicateBUG REPORTNone
DuplicateFeatureNone
Resolvedtaavi
ResolvedRaymond_Ndibe
Resolvedaborrero

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

I succeeded with help from Cloud Services Support to get https://github.com/dpriskorn/ItemSubjector/tree/prepare-batch working

bild.png (1×1 px, 79 KB)

The commands I ran was:
$ python3 itemsubjector/itemsubjector.py -l Q1148337 --prepare-jobs
$ toolforge-jobs run job5 --image tf-python39 --command "./setup.sh && python3 ~/itemsubjector/itemsubjector.py -r"

where setup.sh contains:

$ cat setup.sh
cd ~/itemsubjector
pip install -r requirements.txt

(It is technically possible to use jsub from a process running on the grid, but that is honestly so gross that I try and block it out from my mind.)

I think I have a legit usecase for this. I have a continuous process that writes to a log at least once every 15 mins. So to check if it's working alright, I use a cronjob that every 30 mins checks if the log looks as expected. If not, it restarts the job (code – here npm restart invokes jstart).

(It is technically possible to use jsub from a process running on the grid, but that is honestly so gross that I try and block it out from my mind.)

I think I have a legit usecase for this. I have a continuous process that writes to a log at least once every 15 mins. So to check if it's working alright, I use a cronjob that every 30 mins checks if the log looks as expected. If not, it restarts the job (code – here npm restart invokes jstart).

This will be covered out of the box with the new jobs framework. Just create a jobs with --continuous and kubernetes will make sure there is always a pod running with your code. If it dies for whatever reason, the pod will respawn.

(It is technically possible to use jsub from a process running on the grid, but that is honestly so gross that I try and block it out from my mind.)

I think I have a legit usecase for this. I have a continuous process that writes to a log at least once every 15 mins. So to check if it's working alright, I use a cronjob that every 30 mins checks if the log looks as expected. If not, it restarts the job (code – here npm restart invokes jstart).

This will be covered out of the box with the new jobs framework. Just create a jobs with --continuous and kubernetes will make sure there is always a pod running with your code. If it dies for whatever reason, the pod will respawn.

Restarting on process failure is good, but a bit different than checking for a hung process that has not yet exited. A nice way to do monitoring on a live job in a Kubernetes context would be a "liveness" probe in the pod's configuration that Kubernetes itself would check and when it fails trigger a pod replacement. This is the sort of thing that Kubernetes is pretty good at, but something that we have not made easy to configure yet with our support tooling.

Restarting on process failure is good, but a bit different than checking for a hung process that has not yet exited. A nice way to do monitoring on a live job in a Kubernetes context would be a "liveness" probe in the pod's configuration that Kubernetes itself would check and when it fails trigger a pod replacement. This is the sort of thing that Kubernetes is pretty good at, but something that we have not made easy to configure yet with our support tooling.

Ok I see. I think adding support for that in the framework should be relatively easy. Just another option users can enable on demand.

Perhaps something we can introduce on a later iteration.

I should probably take the time to write some docs that helps tool maintainers understand how to interact with the service, …

Yes, definitely. I've inherited a moderately big/complex (collection of) community tools that have been essentially unmaintained for half a decade, and among the many needed modernisations is the fact that everywhere I look I'm being told old job grid is old and you should use k8 but k8 doesn't support running grid-ish jobs yet. Since I'm in "modernisation mode", migrating to the New Way™ is high up my list and the docs aren't exactly helping me just now.

Incidentally, it was sheer random chance that I happened across this task. If we're serious that this is now at beta maturity level and k8-backed job running is roughly on par with the Job Grid for reliability, Wikitech's Toolforge pages should be hawking the beta prominently anywhere the job grid is mentioned. "Please start testing. This will be the one true way once it's done."

  • Ultimately, I probably can’t use toolforge-jobs for wd-shex-infer because no single image has all the programs needed by the tool.

Some general thoughts on this, based partially on running into the same kind of issue for a tool:

Image design tends to automatically lean towards being very DevOps-y, where you want the images to be lean, optimised, and specialised for the pre-defined role or workload; of which the assumption is that there will be a finite and, relatively speaking, small number. That approach is somewhat mismatched with a general purpose type platform like Toolforge. For that I would more expect a rich basic platform with various more abstract axis of variation. Primarily "long term support", "stable", "bleeding edge"; and possibly "long running computationally expensive" vs. "low latency" (i.e. a "OLAP" vs. "OLTP" type distinction). Or maybe even "Needs Internet access", "Needs wiki access", and "Needs no access" (call it "security posture" or "capability" or whatever). "Etc."

A more purely DevOps-y approach will tend to encourage lots and lots of locally built deps living in tool account home dirs, with duplication between tools, no real security patching, and far less robustness over time (only the maintainer(s) of the tool can patch any deps). I would assert that from a sustainability and resilience perspective, the best practice should be to use as many platform-provided facilities as possible; which in turn means the platform should prioritise richness over leanness.

If your particular need can be met with a very lean image, so much the better. But I think that for a lot of tools they'd really want to choose among "toolforge-compat", "toolforge-stable", and "toolforge-edge"; rather than picking between node, python, and php images ("What if my tool needs all three?").

I haven't checked what's actually included in, say, the buster image listed in the task description, so it may in fact be a completely reasonable tradeoff, but until I get a chance to test them out I'll be worried about whether I'll end up having to maintain a metric boatload of locally-built deps for my tool if I migrate it from the old job grid.

I haven't checked what's actually included in, say, the buster image listed in the task description, so it may in fact be a completely reasonable tradeoff, but until I get a chance to test them out I'll be worried about whether I'll end up having to maintain a metric boatload of locally-built deps for my tool if I migrate it from the old job grid.

The current containers are "lean" and will likely remain that way. Many, many years ago we dreamed of a container that was functionally the full supported runtime from grid engine but gave that idea up when we hit the (at that time) 2 GiB container size limit with hundreds of packages left to add. Modern container build and run tools no longer have the 2GiB size limit, but a 4+ GiB image is really not reasonable for wide spread use in our cluster.

The future fix for per-tool image customization likely to be an outcome of T267374: [tbs.beta] Create a toolforge build service beta release and related work. Until we have something like buildpacks or another per-tool custom container solution we will continue to maintain the grid engine environment to provide support for more "complex" jobs and webservices which need specialized libraries and other binaries that would be container bloat for the majority of tools.

So yes, you will hear WMCS admins and others say "eww, can you use Kubernetes instead" when asked about a new tool adopting grid engine, but in reality the grid will remain for the currently forseeable future. We will have loud announcements (and possibly dancing in the streets by SREs) when we reach a point of broad feature parity and can announce a grid engine removal date, but that date will also very likely include 6-12 months for everyone to migrate their remaining grid engine workloads. Nobody really wants to be responsible for operating a grid engine deployment in 2021, but it allows y'all, the Wikimedia technical community, to do important work for the projects and that is something we do extra work to preserve.

Switching to Kubernetes jobs is a good idea for the tools that fit their constraints today: single language runtime and limited compiled libraries needed. If your tool is a php or python script that does some bot work on the wikis it will likely work for you. If your tool needs specialized image processing libraries, large amounts of fonts, esoteric runtime languages, etc then you are probably not a good fit for migrating to Kubernetes until we have some more core platform features in place.

[lots of sensible things]

That's a nice and clear summation. Could we stash that on Wikitech somewhere visible? Regardless of whether one is particularly happy about all details of it, it is a clearly articulated policy / strategy (take your pick) to use as a foundation for decisions. I can plan based on that (and grumble at its inconveniences elsewhere).

Or put another way, that sort of "here's where we are, here's where we're going" is somewhat a necessary prerequisite for the "Here's how you adjust the mem size in jsub" type docs; and every time I've gone searching for the former I've only found the latter.

Change 762811 had a related patch set uploaded (by Arturo Borrero Gonzalez; author: Arturo Borrero Gonzalez):

[operations/puppet@production] toolforge: automated-tests: introduce some TJF checks

https://gerrit.wikimedia.org/r/762811

Change 762811 merged by Arturo Borrero Gonzalez:

[operations/puppet@production] toolforge: automated-tests: introduce some TJF checks

https://gerrit.wikimedia.org/r/762811