Toolforge: beta phase for the new jobs framework
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	aborrero
	Jul 1 2021, 12:38 PM

Description

The initial build out of the new Toolforge jobs framework has been deployed.

We move into the BETA phase, where we expect to:

collect feedback from beta testers
identify & correct flaws in the design
identify & correct general bugs
add new or missing features that may be interesting to the community

See live documentation at https://wikitech.wikimedia.org/wiki/Help:Toolforge/Jobs_framework

How to operate with the framework in this beta stage:

$ ssh dev-buster.toolforge.org
$ become $mytool
$ toolforge-jobs containers
Short name     Docker container image
-------------  ----------------------------------------------------------------------
tf-buster-std  docker-registry.tools.wmflabs.org/toolforge-buster-standalone:latest
tf-golang111   docker-registry.tools.wmflabs.org/toolforge-golang111-sssd-base:latest
tf-jdk11       docker-registry.tools.wmflabs.org/toolforge-jdk11-sssd-base:latest
tf-node10      docker-registry.tools.wmflabs.org/toolforge-node10-sssd-base:latest
tf-php73       docker-registry.tools.wmflabs.org/toolforge-php73-sssd-base:latest
tf-python37    docker-registry.tools.wmflabs.org/toolforge-python37-sssd-base:latest
tf-ruby25      docker-registry.tools.wmflabs.org/toolforge-ruby25-sssd-base:latest
tf-tcl86       docker-registry.tools.wmflabs.org/toolforge-tcl86-sssd-base:latest
wm-buster      docker-registry.tools.wmflabs.org/wikimedia-buster:latest
wm-stretch     docker-registry.tools.wmflabs.org/wikimedia-stretch:latest
[..]

$ # running a normal job:
$ toolforge-jobs run myjob --command ./mycommand.sh --image tf-buster-sd

$ # running a normal job and waiting for it to complete:
$ toolforge-jobs run myotherjob --command ./myothercommand.sh --image tf-buster-sd --wait

$ # running a continuous job:
$ toolforge-jobs run myalwaysrunningjob --command ./myendlesscommand.sh --image tf-buster-sd --continuous

$ # running a scheduled job:
$ toolforge-jobs run myscheduledjob --command ./everyminute.sh --image tf-buster-sd --schedule "1 * * * *"

$ toolforge-jobs list
Job name:       Command:                 Job type:            Container:     Status:
--------------  -----------------------  -------------------  -------------  ---------------------------
myscheduledjob  ./read-dumps.sh          schedule: * * * * *  tf-buster-std  Last schedule time: 2021-06-30T10:26:00Z
alwaysrunning   ./myendlesscommand.sh    continuous           tf-buster-std  Running
myjob           ./mycommand.sh           normla               tf-buster-std  Completed

$ toolforge-jobs show myscheduledjob
+------------+-----------------------------------------------------------------+
| Job name:  | myscheduledjob                                                  |
+------------+-----------------------------------------------------------------+
| Command:   | ./read-dumps.sh                                                 |
+------------+-----------------------------------------------------------------+
| Job type:  | schedule: * * * * *                                             |
+------------+-----------------------------------------------------------------+
| Container: | tf-buster-std                                                   |
+------------+-----------------------------------------------------------------+
| Status:    | Last schedule time: 2021-06-30T10:26:00Z                        |
+------------+-----------------------------------------------------------------+
| Hints:     | Last run at 2021-06-30T10:26:08Z. Pod in 'Pending' phase. State |
|            | 'waiting' for reason 'ContainerCreating'.                       |
+------------+-----------------------------------------------------------------+

$ toolforge-jobs delete myscheduledjob
$ toolforge-jobs flush
$ toolforge-jobs list
[.. nothing ..]

Details

Subject	Repo	Branch	Lines +/-
toolforge: automated-tests: introduce some TJF checks	operations/puppet	production	+47 -0
k8sclient: refresh comment about cronjob versions	cloud/toolforge/jobs-framework-api	main	+1 -1
job: correctly parse user name if it contains hypens	cloud/toolforge/jobs-framework-api	main	+1 -1

Customize query in gerrit

Related Objects
Search...

Status	Subtype	Assigned	Task
Resolved		• JHedden	T251027 "signatures" tool has failed job pods on Kubernetes cluster
Resolved		aborrero	T251917 Design the Jobs service in k8s
Resolved		aborrero	T283238 Toolforge: develop jobs-framework-api
Resolved		aborrero	T285944 Toolforge: beta phase for the new jobs framework
Resolved		JJMC89	T285961 jobs-framework-cli: display something when involved with no arguments
Resolved		aborrero	T285963 jobs framework should not error out when executed by a non-tool account
Resolved	BUG REPORT	aborrero	T285979 toolforge-jobs with --wait hangs indefinitely if the job fails
Resolved	BUG REPORT	aborrero	T286107 toolforge-jobs: Allow specifying arguments to commands
Resolved	BUG REPORT	aborrero	T286108 toolforge-jobs: Clean up old individual job objects
Resolved	Feature	aborrero	T286126 toolforge-jobs: allow setting limits and requests
Resolved	BUG REPORT	aborrero	T286132 toolforge-jobs: "Status: Unknown" when job is running
Resolved	Feature	aborrero	T286135 Toolforge jobs framework: email maintainers on job failure
Resolved		aborrero	T286485 toolforge-jobs: figure out logging
Resolved		aborrero	T286492 toolforge-jobs: load jobs from a file
Resolved		• Bstorm	T286784 toolforge-jobs: figure out default quotas and limits
Invalid	BUG REPORT	None	T286824 toolforge-jobs cli: commands are missing or truncated in list/show output
Resolved	Feature	aborrero	T287077 toolforge-jobs: Indicate which containers are deprecated
Resolved		taavi	T287275 toolforge-jobs: reject jobs with more resource requests than single pods can use
Resolved	BUG REPORT	So9q	T299039 All started jobs failed on Kubernetes during 24h with no visible error or output
Resolved		aborrero	T301665 Toolforge jobs framework: create documentation on wikitech
Resolved	Feature	Raymond_Ndibe	T301901 Allow specifying the path for log files for jobs executed on the new toolforge Jobs framework
Resolved	Feature	Raymond_Ndibe	T302211 toolforge-jobs: merge stdout/stderr output
Duplicate		None	T302988 Toolforge kubernetes and pywikibot
Duplicate	Feature	Raymond_Ndibe	T304421 Allow customizing the out/err files with toolforge-jobs
Resolved	BUG REPORT	JJMC89	T304861 Typo in toolforge-jobs output
Resolved	BUG REPORT	Raymond_Ndibe	T304893 Rethink job retries in case of failures
Resolved	BUG REPORT	aborrero	T304900 toolforge-jobs should properly process 'out of quota' errors
Resolved	BUG REPORT	taavi	T305592 Underscore in job name gives non-helpful error in toolforge-jobs
Resolved	Feature	taavi	T306311 toolforge-jobs should have a method to easily restart a specific job
Resolved	Feature	taavi	T306312 toolforge-jobs should have a method to load a single task (or a subset of tasks) from a yaml file
Duplicate	BUG REPORT	None	T306454 toolforge-jobs rejects valid schedule time
Duplicate	Feature	None	T307486 add mysql to toolforge-bullseye-standalone
Resolved		taavi	T309146 Perl Docker image not available for Toolforge Kubernetes jobs
Resolved		Raymond_Ndibe	T315114 Make it possible to configure retry policy for jobs executed on the toolforge jobs framework
Resolved		aborrero	T327278 Toolforge jobs framework: define end of beta phase

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

taavi closed subtask T287275: toolforge-jobs: reject jobs with more resource requests than single pods can use as Resolved.Sep 7 2021, 8:17 AM

aborrero removed aborrero as the assignee of this task.Sep 15 2021, 9:44 AM

So9q awarded a token.Sep 23 2021, 10:15 AM

So9q subscribed.

I succeeded with help from Cloud Services Support to get https://github.com/dpriskorn/ItemSubjector/tree/prepare-batch working

The commands I ran was:
$ python3 itemsubjector/itemsubjector.py -l Q1148337 --prepare-jobs
$ toolforge-jobs run job5 --image tf-python39 --command "./setup.sh && python3 ~/itemsubjector/itemsubjector.py -r"

where setup.sh contains:

$ cat setup.sh
cd ~/itemsubjector
pip install -r requirements.txt

PeterBowman subscribed.Sep 27 2021, 12:09 PM

In T285944#7233967, @bd808 wrote:

(It is technically possible to use jsub from a process running on the grid, but that is honestly so gross that I try and block it out from my mind.)

I think I have a legit usecase for this. I have a continuous process that writes to a log at least once every 15 mins. So to check if it's working alright, I use a cronjob that every 30 mins checks if the log looks as expected. If not, it restarts the job (code – here npm restart invokes jstart).

In T285944#7388245, @SD0001 wrote:

In T285944#7233967, @bd808 wrote:

(It is technically possible to use jsub from a process running on the grid, but that is honestly so gross that I try and block it out from my mind.)

I think I have a legit usecase for this. I have a continuous process that writes to a log at least once every 15 mins. So to check if it's working alright, I use a cronjob that every 30 mins checks if the log looks as expected. If not, it restarts the job (code – here npm restart invokes jstart).

This will be covered out of the box with the new jobs framework. Just create a jobs with --continuous and kubernetes will make sure there is always a pod running with your code. If it dies for whatever reason, the pod will respawn.

In T285944#7388343, @aborrero wrote:

In T285944#7388245, @SD0001 wrote:

In T285944#7233967, @bd808 wrote:

(It is technically possible to use jsub from a process running on the grid, but that is honestly so gross that I try and block it out from my mind.)

I think I have a legit usecase for this. I have a continuous process that writes to a log at least once every 15 mins. So to check if it's working alright, I use a cronjob that every 30 mins checks if the log looks as expected. If not, it restarts the job (code – here npm restart invokes jstart).

This will be covered out of the box with the new jobs framework. Just create a jobs with --continuous and kubernetes will make sure there is always a pod running with your code. If it dies for whatever reason, the pod will respawn.

Restarting on process failure is good, but a bit different than checking for a hung process that has not yet exited. A nice way to do monitoring on a live job in a Kubernetes context would be a "liveness" probe in the pod's configuration that Kubernetes itself would check and when it fails trigger a pod replacement. This is the sort of thing that Kubernetes is pretty good at, but something that we have not made easy to configure yet with our support tooling.

In T285944#7388366, @bd808 wrote:

Restarting on process failure is good, but a bit different than checking for a hung process that has not yet exited. A nice way to do monitoring on a live job in a Kubernetes context would be a "liveness" probe in the pod's configuration that Kubernetes itself would check and when it fails trigger a pod replacement. This is the sort of thing that Kubernetes is pretty good at, but something that we have not made easy to configure yet with our support tooling.

Ok I see. I think adding support for that in the framework should be relatively easy. Just another option users can enable on demand.

Perhaps something we can introduce on a later iteration.

For later consideration: T213641: Design mechanism and process for upgrading Kubernetes container runtimes

In T285944#7334210, @aborrero wrote:

I should probably take the time to write some docs that helps tool maintainers understand how to interact with the service, …

Yes, definitely. I've inherited a moderately big/complex (collection of) community tools that have been essentially unmaintained for half a decade, and among the many needed modernisations is the fact that everywhere I look I'm being told old job grid is old and you should use k8 but k8 doesn't support running grid-ish jobs yet. Since I'm in "modernisation mode", migrating to the New Way™ is high up my list and the docs aren't exactly helping me just now.

Incidentally, it was sheer random chance that I happened across this task. If we're serious that this is now at beta maturity level and k8-backed job running is roughly on par with the Job Grid for reliability, Wikitech's Toolforge pages should be hawking the beta prominently anywhere the job grid is mentioned. "Please start testing. This will be the one true way once it's done."

In T285944#7233602, @LucasWerkmeister wrote:

Ultimately, I probably can’t use toolforge-jobs for wd-shex-infer because no single image has all the programs needed by the tool.

Some general thoughts on this, based partially on running into the same kind of issue for a tool:

Image design tends to automatically lean towards being very DevOps-y, where you want the images to be lean, optimised, and specialised for the pre-defined role or workload; of which the assumption is that there will be a finite and, relatively speaking, small number. That approach is somewhat mismatched with a general purpose type platform like Toolforge. For that I would more expect a rich basic platform with various more abstract axis of variation. Primarily "long term support", "stable", "bleeding edge"; and possibly "long running computationally expensive" vs. "low latency" (i.e. a "OLAP" vs. "OLTP" type distinction). Or maybe even "Needs Internet access", "Needs wiki access", and "Needs no access" (call it "security posture" or "capability" or whatever). "Etc."

A more purely DevOps-y approach will tend to encourage lots and lots of locally built deps living in tool account home dirs, with duplication between tools, no real security patching, and far less robustness over time (only the maintainer(s) of the tool can patch any deps). I would assert that from a sustainability and resilience perspective, the best practice should be to use as many platform-provided facilities as possible; which in turn means the platform should prioritise richness over leanness.

If your particular need can be met with a very lean image, so much the better. But I think that for a lot of tools they'd really want to choose among "toolforge-compat", "toolforge-stable", and "toolforge-edge"; rather than picking between node, python, and php images ("What if my tool needs all three?").

I haven't checked what's actually included in, say, the buster image listed in the task description, so it may in fact be a completely reasonable tradeoff, but until I get a chance to test them out I'll be worried about whether I'll end up having to maintain a metric boatload of locally-built deps for my tool if I migrate it from the old job grid.

In T285944#7464392, @Xover wrote:

I haven't checked what's actually included in, say, the buster image listed in the task description, so it may in fact be a completely reasonable tradeoff, but until I get a chance to test them out I'll be worried about whether I'll end up having to maintain a metric boatload of locally-built deps for my tool if I migrate it from the old job grid.

The current containers are "lean" and will likely remain that way. Many, many years ago we dreamed of a container that was functionally the full supported runtime from grid engine but gave that idea up when we hit the (at that time) 2 GiB container size limit with hundreds of packages left to add. Modern container build and run tools no longer have the 2GiB size limit, but a 4+ GiB image is really not reasonable for wide spread use in our cluster.

The future fix for per-tool image customization likely to be an outcome of T267374: [tbs.beta] Create a toolforge build service beta release and related work. Until we have something like buildpacks or another per-tool custom container solution we will continue to maintain the grid engine environment to provide support for more "complex" jobs and webservices which need specialized libraries and other binaries that would be container bloat for the majority of tools.

So yes, you will hear WMCS admins and others say "eww, can you use Kubernetes instead" when asked about a new tool adopting grid engine, but in reality the grid will remain for the currently forseeable future. We will have loud announcements (and possibly dancing in the streets by SREs) when we reach a point of broad feature parity and can announce a grid engine removal date, but that date will also very likely include 6-12 months for everyone to migrate their remaining grid engine workloads. Nobody really wants to be responsible for operating a grid engine deployment in 2021, but it allows y'all, the Wikimedia technical community, to do important work for the projects and that is something we do extra work to preserve.

Switching to Kubernetes jobs is a good idea for the tools that fit their constraints today: single language runtime and limited compiled libraries needed. If your tool is a php or python script that does some bot work on the wikis it will likely work for you. If your tool needs specialized image processing libraries, large amounts of fonts, esoteric runtime languages, etc then you are probably not a good fit for migrating to Kubernetes until we have some more core platform features in place.

In T285944#7467142, @bd808 wrote:

[lots of sensible things]

That's a nice and clear summation. Could we stash that on Wikitech somewhere visible? Regardless of whether one is particularly happy about all details of it, it is a clearly articulated policy / strategy (take your pick) to use as a foundation for decisions. I can plan based on that (and grumble at its inconveniences elsewhere).

Or put another way, that sort of "here's where we are, here's where we're going" is somewhat a necessary prerequisite for the "Here's how you adjust the mem size in jsub" type docs; and every time I've gone searching for the former I've only found the latter.

Legoktm mentioned this in T135348: "webservice" command (with jsub) doesn't work in crontab.Nov 11 2021, 5:18 AM

aborrero changed the status of subtask T286135: Toolforge jobs framework: email maintainers on job failure from Open to Stalled.Dec 15 2021, 10:18 AM

aborrero added a subtask: T299039: All started jobs failed on Kubernetes during 24h with no visible error or output.Jan 12 2022, 11:58 AM

• nskaggs mentioned this in T277653: Toolforge: add Debian Buster to the grid and eliminate Debian Stretch.Feb 1 2022, 2:08 PM

taavi mentioned this in T301081: toolforge: Fix job/cronjob quotas.Feb 7 2022, 10:21 AM

Change 762811 had a related patch set uploaded (by Arturo Borrero Gonzalez; author: Arturo Borrero Gonzalez):

[operations/puppet@production] toolforge: automated-tests: introduce some TJF checks

https://gerrit.wikimedia.org/r/762811

gerritbot added a project: Patch-For-Review.Feb 15 2022, 1:07 PM

Change 762811 merged by Arturo Borrero Gonzalez:

[operations/puppet@production] toolforge: automated-tests: introduce some TJF checks

https://gerrit.wikimedia.org/r/762811