⚓ T320140 Migrate wd-shex-infer from Toolforge GridEngine to Toolforge Kubernetes

	Title	Reference	Author	Source Branch	Dest Branch
	Migrate from Grid Engine to Kubernetes	toolforge-repos/wd-shex-infer!1	lucaswerkmeister	k8s	main

Status	Assigned	Task
		· · ·
Resolved	LucasWerkmeister	T320140 Migrate wd-shex-infer from Toolforge GridEngine to Toolforge Kubernetes
Open	dcaro	T194332 [Epic,builds-api,components-api,webservice,jobs-api] Make Toolforge a proper platform as a service with push-to-deploy and build packs
Resolved	taavi	T353698 dynamicproxy client_max_body_size blocks large Harbor uploads
Resolved	Slst2020	T355997 Requesting additional Harbor / Toolforge Build Service disk quota for wd-shex-infer tool
Stalled	dcaro	T356016 [builds-builder,jobs-api,upstream] Calling nontrivial Procfile commands with arguments results in confusing error (“no such file or directory”)
Resolved	taavi	T357172 Tool user not allowed to read jobs/status in Kubernetes
Resolved	taavi	T357209 Request increased memory quota for wd-shex-infer Toolforge tool
		· · ·

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptOct 6 2022, 10:53 PM

My apologies if this ticket comes as a surprise to you. In order to ensure WMCS can provide a stable, secure and supported platform, it’s important we migrate away from GridEngine. I want to assure you that while it is WMCS’s intention to shutdown GridEngine as outlined in the blog post https://techblog.wikimedia.org/2022/03/14/toolforge-and-grid-engine/, a shutdown date for GridEngine has not yet been set. The goal of the migration is to migrate as many tools as possible onto kubernetes and ensure as smooth a transition as possible for everyone. Once the majority of tools have migrated, discussion on a shutdown date is more appropriate. See T314664: [infra] Decommission the Grid Engine infrastructure.

As noted in https://techblog.wikimedia.org/2022/03/16/toolforge-gridengine-debian-10-buster-migration/ some use cases are already supported by kubernetes and should be migrated. If your tool can migrate, please do plan a migration. Reach out if you need help or find you are blocked by missing features. Most of all, WMCS is here to support you.

However, it’s possible your tool needs a mixed runtime environment or some other features that aren't yet present in https://techblog.wikimedia.org/2022/03/18/toolforge-jobs-framework/. We’d love to hear of this or any other blocking issues so we can work with you once a migration path is ready. Thanks for your hard work as volunteers and help in this migration!

I tried migrating this tool to toolforge-jobs over a year ago (see T285944#7233602), and found that it wasn’t possible: I need a container image with several language runtimes at once. Once this is possible (probably using buildpacks), I’ll happily try to migrate the tool – but as I haven’t heard any recent news about that, I frankly do not appreciate being spammed by this task that I can’t do anything about yet. (I use the term “spammed” because, while the number of tasks I’ve been subscribed to myself is manageable, I see you’ve created a huge amount of tasks in total.)

LucasWerkmeister added a subtask: T194332: [Epic,builds-api,components-api,webservice,jobs-api] Make Toolforge a proper platform as a service with push-to-deploy and build packs.Oct 7 2022, 8:10 AM

komla moved this task from Backlog to Needs custom image (T194332) on the Grid-Engine-to-K8s-Migration board.Oct 14 2022, 12:16 PM

komla moved this task from Needs custom image (T194332) to Needs multi-stack image on the Grid-Engine-to-K8s-Migration board.May 9 2023, 3:58 AM

I suppose this is unstalled now, or at least the cloud maintainers think so (given that there’s now a hard deadline for this task)?

@LucasWerkmeister I would encourage you to try using apt to fulfill missing dependencies: https://wikitech.wikimedia.org/wiki/Help:Toolforge/Build_Service#Installing_Apt_packages. If you instead are looking for multi-language, see https://wikitech.wikimedia.org/wiki/Help:Toolforge/Build_Service#Using_Node.js_in_addition_to_another_language.

Do let us know if neither of these approaches work for you. Thanks!

Okay, I think the general plan is:

Build an image that contains all the needed software. Java, Node, Python 3, whatever else I forgot.
Run the webservice in that image.
Have the webservice launch toolforge-jobs jobs (or directly k8s jobs?) using that same image.

I’ll have to check whether the image build process allows build scripts in addition to apt package lists (to build the Java software). If not, I guess I can just build the Java software elsewhere once and then download and extract it in each job.

In T320140#9395060, @LucasWerkmeister wrote:

Build an image that contains all the needed software. Java, Node, Python 3, whatever else I forgot.

Well, I tried to do this (wd-shex-infer k8s branch), but building the image fails:

[step-export] 2023-12-17T15:48:07.157830449Z Setting default process type 'web'                                                                                                                                                                                                                       
[step-export] 2023-12-17T15:48:07.158420696Z Saving tools-harbor.wmcloud.org/tool-lucaswerkmeister-test/tool-lucaswerkmeister-test:latest...                                                                                                                                                          
[step-export] 2023-12-17T15:48:15.496301411Z *** Images (sha256:40aa16faef03b53007f6c8d727beb2fe86625d654632799062afc3519d304144):                                                                                                                                                                    
[step-export] 2023-12-17T15:48:15.496379093Z       tools-harbor.wmcloud.org/tool-lucaswerkmeister-test/tool-lucaswerkmeister-test:latest - PATCH https://tools-harbor.wmcloud.org/v2/tool-lucaswerkmeister-test/tool-lucaswerkmeister-test/blobs/uploads/3432fdd6-a17e-40b0-ab35-90daeb037862?_state=R
EDACTED: unexpected status code 413 Request Entity Too Large: <html>                                                                                                                                                                                                                                  
[step-export] 2023-12-17T15:48:15.496397506Z <head><title>413 Request Entity Too Large</title></head>
[step-export] 2023-12-17T15:48:15.496410030Z <body>
[step-export] 2023-12-17T15:48:15.496420575Z <center><h1>413 Request Entity Too Large</h1></center>
[step-export] 2023-12-17T15:48:15.496432807Z <hr><center>nginx/1.18.0</center>
[step-export] 2023-12-17T15:48:15.496443700Z </body>
[step-export] 2023-12-17T15:48:15.496453732Z </html>
[step-export] 2023-12-17T15:48:15.496463386Z 
[step-export] 2023-12-17T15:48:15.567426025Z ERROR: failed to export: failed to write image to the following tags: [tools-harbor.wmcloud.org/tool-lucaswerkmeister-test/tool-lucaswerkmeister-test:latest: PATCH https://tools-harbor.wmcloud.org/v2/tool-lucaswerkmeister-test/tool-lucaswerkmeister-
test/blobs/uploads/3432fdd6-a17e-40b0-ab35-90daeb037862?_state=REDACTED: unexpected status code 413 Request Entity Too Large: <html>
[step-export] 2023-12-17T15:48:15.567491279Z <head><title>413 Request Entity Too Large</title></head>
[step-export] 2023-12-17T15:48:15.567505566Z <body>
[step-export] 2023-12-17T15:48:15.567538225Z <center><h1>413 Request Entity Too Large</h1></center>
[step-export] 2023-12-17T15:48:15.567549078Z <hr><center>nginx/1.18.0</center>
[step-export] 2023-12-17T15:48:15.567559686Z </body>
[step-export] 2023-12-17T15:48:15.567568449Z </html>
[step-export] 2023-12-17T15:48:15.567577351Z ]

Is there a size limit for buildpack container images? Can I request a quota increase somewhere? (I already tried clearing out older builds and it didn’t help.)

taavi mentioned this in T353698: dynamicproxy client_max_body_size blocks large Harbor uploads.Dec 19 2023, 1:30 PM

taavi closed subtask T353698: dynamicproxy client_max_body_size blocks large Harbor uploads as Resolved.Dec 19 2023, 1:44 PM

T353698 solved building the image (thanks!); now I’m stuck on T353847.

In T320140#9420202, @LucasWerkmeister wrote:

T353698 solved building the image (thanks!); now I’m stuck on T353847.

Just deployed a fix for that, it should be pulling all the needed dependencies, can you try again?

Alright, building RDFSimpleCon works now, but there’s an error when building RDF2Graph (which depends on RDFSimpleCon):

tools.lucaswerkmeister-test@tools-sgebastion-10:~$ toolforge jobs logs setup-build-rdf2graph
2024-01-27T14:06:02+00:00 [setup-build-rdf2graph-q6ckh] [INFO] Scanning for projects...
2024-01-27T14:06:02+00:00 [setup-build-rdf2graph-q6ckh] [INFO] 
2024-01-27T14:06:02+00:00 [setup-build-rdf2graph-q6ckh] [INFO] -------------------< nl.wur.ssb.RDF2Graph:RDF2Graph >-------------------
2024-01-27T14:06:02+00:00 [setup-build-rdf2graph-q6ckh] [INFO] Building Recovers the structure of a RDF resource 0.1
2024-01-27T14:06:02+00:00 [setup-build-rdf2graph-q6ckh] [INFO] --------------------------------[ jar ]---------------------------------
2024-01-27T14:06:08+00:00 [setup-build-rdf2graph-q6ckh] [WARNING] The POM for nl.wur.ssb.RDFSimpleCon:RDFSimpleCon:jar:0.1 is missing, no dependency information available
2024-01-27T14:06:08+00:00 [setup-build-rdf2graph-q6ckh] [INFO] ------------------------------------------------------------------------
2024-01-27T14:06:08+00:00 [setup-build-rdf2graph-q6ckh] [INFO] BUILD FAILURE
2024-01-27T14:06:08+00:00 [setup-build-rdf2graph-q6ckh] [INFO] ------------------------------------------------------------------------
2024-01-27T14:06:08+00:00 [setup-build-rdf2graph-q6ckh] [INFO] Total time:  6.598 s
2024-01-27T14:06:08+00:00 [setup-build-rdf2graph-q6ckh] [INFO] Finished at: 2024-01-27T14:06:08Z
2024-01-27T14:06:08+00:00 [setup-build-rdf2graph-q6ckh] [INFO] ------------------------------------------------------------------------
2024-01-27T14:06:08+00:00 [setup-build-rdf2graph-q6ckh] [ERROR] Failed to execute goal on project RDF2Graph: Could not resolve dependencies for project nl.wur.ssb.RDF2Graph:RDF2Graph:jar:0.1: Failure to find nl.wur.ssb.RDFSimpleCon:RDFSimpleCon:jar:0.1 in https://repo.maven.apache.org/maven2 was cached in the local repository, resolution will not be reattempted until the update interval of central has elapsed or updates are forced -> [Help 1]
2024-01-27T14:06:08+00:00 [setup-build-rdf2graph-q6ckh] [ERROR] 
2024-01-27T14:06:08+00:00 [setup-build-rdf2graph-q6ckh] [ERROR] To see the full stack trace of the errors, re-run Maven with the -e switch.
2024-01-27T14:06:08+00:00 [setup-build-rdf2graph-q6ckh] [ERROR] Re-run Maven using the -X switch to enable full debug logging.
2024-01-27T14:06:08+00:00 [setup-build-rdf2graph-q6ckh] [ERROR] 
2024-01-27T14:06:08+00:00 [setup-build-rdf2graph-q6ckh] [ERROR] For more information about the errors and possible solutions, please read the following articles:
2024-01-27T14:06:08+00:00 [setup-build-rdf2graph-q6ckh] [ERROR] [Help 1] http://cwiki.apache.org/confluence/display/MAVEN/DependencyResolutionException

I feel like that’s most likely something that changed in Maven in the several years since I last worked on these builds; I’ll have to look into it further.

Hmph, if I try it out locally, I can still build RDFSimpleCon and RDFGraph. I had to change the Java source and target version from 1.7 to 1.8 in both pom.xml files (apparently Java 20 dropped support for compiling Java 7?), but I that’s surely just because I have a newer Java locally and shouldn’t affect how Maven looks for dependencies.

Time to add some of those logging flags, I guess.

Ohhh, I bet I know what’s going on… when I locally run mvn install in RDFSimpleCon, Maven installs it into my ~/.m2, and then the RDF2Graph build gets it from there. (I assumed RDF2Graph got it from the RDFSimpleCon subdirectory, but I can’t find any configuration for that in its pom.xml.) And in toolforge-jobs images, the home directory is temporary (IIRC?), so the job to build RDF2Graph starts out with an empty ~/.m2 that doesn’t contain RDFSimpleCon.

Should be easy enough to fix if I can figure out how to use a custom path for that Maven directory.

Yup, adding -Dmaven.repo.local="$TOOL_DATA_DIR/k8sroot/mvnrepo" to the mvn invocations fixed the Maven part \o/

Alright, with a few further fixes, all the setup- commands in the Procfile work now. (The image also takes 15 minutes to build; the main culprit seems to be the npm apt package, with its tons of tiny dependencies.)

Next big question is whether a make actually does the right thing…

LucasWerkmeister mentioned this in T355997: Requesting additional Harbor / Toolforge Build Service disk quota for wd-shex-infer tool.Jan 27 2024, 5:43 PM

Hm, make is erroring out and I don’t understand it yet.

I have this in my Procfile:

make: cd "$TOOL_DATA_DIR/k8sroot/RDF2Graph" && make PATH="$PATH:$TOOL_DATA_DIR/k8sroot/bin"

And when I run

toolforge jobs run --wait --image tool-lucaswerkmeister-test/tool-lucaswerkmeister-test:latest --mount all --command 'make random-item.shex' make-random-item

The logs of the job show the error:

2024-01-27T17:51:00+00:00 [make-random-item-lr7cs] cd "$TOOL_DATA_DIR/k8sroot/RDF2Graph" && make PATH="$PATH:$TOOL_DATA_DIR/k8sroot/bin": line 1: cd /data/project/lucaswerkmeister-test/k8sroot/RDF2Graph && make PATH=/cnb/process:/cnb/lifecycle:/layers/heroku_python/python/bin:/layers/heroku_python/dependencies/bin:/layers/fagiani_apt/apt/usr/bin:/layers/fagiani_apt/apt/usr/lib/jvm/java-11-openjdk-amd64/bin:/layers/fagiani_apt/apt/usr/lib/x86_64-linux-gnu/guile/3.0/bin:/layers/fagiani_apt/apt/usr/bin:/layers/fagiani_apt/apt/usr/share/maven/bin:/layers/fagiani_apt/apt/usr/share/nodejs/@npmcli/arborist/bin:/layers/fagiani_apt/apt/usr/share/nodejs/which/bin:/layers/fagiani_apt/apt/usr/share/nodejs/qrcode-terminal/bin:/layers/fagiani_apt/apt/usr/share/nodejs/mkdirp/bin:/layers/fagiani_apt/apt/usr/share/nodejs/npm-packlist/bin:/layers/fagiani_apt/apt/usr/share/nodejs/semver/bin:/layers/fagiani_apt/apt/usr/share/nodejs/npm/bin:/layers/fagiani_apt/apt/usr/share/nodejs/nopt/bin:/layers/fagiani_apt/apt/usr/share/nodejs/node-gyp/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/data/project/lucaswerkmeister-test/k8sroot/bin: No such file or directory

And I don’t know which part the “No such file or directory” refers to. The “line 1: cd …” suggests we’re still in the surrounding shell, not in make itself; but the directory to cd into seems to exist as far as I can tell, and make is installed and in the path too.

It’s strange that the error message lists the command once with environment variables substituted and once without; but in other Procfile commands, I got the impression that shell syntax and quoting is fully supported, so I don’t think it should be treating the whole line as a single command that’s not found…

I tried moving the whole command into the Procfile (i.e. the random-item.shex as the make target) and it didn’t change anything:

Procfile (abridged)

make: cd "$TOOL_DATA_DIR/k8sroot/RDF2Graph" && make PATH="$PATH:$TOOL_DATA_DIR/k8sroot/bin" random-item.shex

tools.lucaswerkmeister-test@tools-sgebastion-10:~$ toolforge jobs run --wait --image tool-lucaswerkmeister-test/tool-lucaswerkmeister-test:latest --mount all --command 'make' make-random-item
tools.lucaswerkmeister-test@tools-sgebastion-10:~$ toolforge-jobs logs --follow make-random-item
2024-01-27T18:23:44+00:00 [make-random-item-thq6g] cd "$TOOL_DATA_DIR/k8sroot/RDF2Graph" && make PATH="$PATH:$TOOL_DATA_DIR/k8sroot/bin" random-item.shex: line 1: cd /data/project/lucaswerkmeister-test/k8sroot/RDF2Graph && make PATH=/cnb/process:/cnb/lifecycle:/layers/heroku_python/python/bin:/layers/heroku_python/dependencies/bin:/layers/fagiani_apt/apt/usr/bin:/layers/fagiani_apt/apt/usr/lib/jvm/java-11-openjdk-amd64/bin:/layers/fagiani_apt/apt/usr/lib/x86_64-linux-gnu/guile/3.0/bin:/layers/fagiani_apt/apt/usr/bin:/layers/fagiani_apt/apt/usr/share/maven/bin:/layers/fagiani_apt/apt/usr/share/nodejs/@npmcli/arborist/bin:/layers/fagiani_apt/apt/usr/share/nodejs/which/bin:/layers/fagiani_apt/apt/usr/share/nodejs/qrcode-terminal/bin:/layers/fagiani_apt/apt/usr/share/nodejs/mkdirp/bin:/layers/fagiani_apt/apt/usr/share/nodejs/npm-packlist/bin:/layers/fagiani_apt/apt/usr/share/nodejs/semver/bin:/layers/fagiani_apt/apt/usr/share/nodejs/npm/bin:/layers/fagiani_apt/apt/usr/share/nodejs/nopt/bin:/layers/fagiani_apt/apt/usr/share/nodejs/node-gyp/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/layers/heroku_python/python/bin:/layers/heroku_python/dependencies/bin:/layers/fagiani_apt/apt/usr/bin:/layers/fagiani_apt/apt/usr/lib/jvm/java-11-openjdk-amd64/bin:/layers/fagiani_apt/apt/usr/lib/x86_64-linux-gnu/guile/3.0/bin:/layers/fagiani_apt/apt/usr/bin:/layers/fagiani_apt/apt/usr/share/maven/bin:/layers/fagiani_apt/apt/usr/share/nodejs/@npmcli/arborist/bin:/layers/fagiani_apt/apt/usr/share/nodejs/which/bin:/layers/fagiani_apt/apt/usr/share/nodejs/qrcode-terminal/bin:/layers/fagiani_apt/apt/usr/share/nodejs/mkdirp/bin:/layers/fagiani_apt/apt/usr/share/nodejs/npm-packlist/bin:/layers/fagiani_apt/apt/usr/share/nodejs/semver/bin:/layers/fagiani_apt/apt/usr/share/nodejs/npm/bin:/layers/fagiani_apt/apt/usr/share/nodejs/nopt/bin:/layers/fagiani_apt/apt/usr/share/nodejs/node-gyp/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/data/project/lucaswerkmeister-test/k8sroot/bin random-item.shex: No such file or directory

Though I did notice one thing in the image build output earlier:

[step-build] 2024-01-27T18:06:05.314583365Z -----> Fetching .debs for dependency make (pulled by dpkg-dev)
[step-build] 2024-01-27T18:06:06.044712336Z        Choosing make-guile for virtual package make
[step-build] 2024-01-27T18:06:06.054104423Z        Skipping make-guile, already downloaded.

Why is it picking make-guile instead of make, which seems to be a real package? And perhaps make-guile is broken in some way that causes the error? No idea.

After renaming the Procfile entry from make to run-make (after a conversation with @Soda on Telegram that also resulted in https://gitlab.wikimedia.org/toolforge-repos/link-dispenser/-/commit/6564d2e648da4c0c0317577337ed1afa0a4e0d95 – it turns out Procfile entries end up in the $PATH and shadow other commands by that name!), I have a version that (almost) works:

Procfile

run-make: cd "$TOOL_DATA_DIR/k8sroot/RDF2Graph" && make PATH="$PATH:$TOOL_DATA_DIR/k8sroot/bin"
run-make-random-item: cd "$TOOL_DATA_DIR/k8sroot/RDF2Graph" && make PATH="$PATH:$TOOL_DATA_DIR/k8sroot/bin" random-item.shex

--command 'run-make random-item.shex' still results in a confusing error, but --command run-make-random-item works. The build service documentation currently says that

You could also pass additional arguments, for example --command "migrate --production" would run the script specified in Procfile with the --production argument.

but, based on this experience, I’m starting to doubt whether that’s true. But be that as it may, I should be able to generalize the run-make-random-item command by just putting make $MAKETARGET in the Procfile and setting that environment variable, or something like that.

LucasWerkmeister removed a subscriber: Soda.Jan 27 2024, 8:32 PM

Alright, after a bunch more tweaks, it’s working \o/ \o/ \o/

Specifically,

echo 'SELECT (wd:Q104625404 AS ?entity) {}' > k8sroot/RDF2Graph/random-item.entities.sparql
toolforge envvars create MAKETARGET random-item.shex
toolforge jobs run --mem 6G --cpu 3 --wait --image tool-lucaswerkmeister-test/tool-lucaswerkmeister-test:latest --mount all --command run-make{,}

creates a valid (if short – I picked a small example item, after all ^^) file in k8sroot/RDF2Graph/random-item.shex.

(6G mem and 3 CPU is the per-job limit according to toolforge jobs quota. The Grid version used to run jobs with -mem 8g, so I might want to request a minor quota increase there, I think that number was already chosen based on experience.)

Next steps:

Continue working on the webservice code to actually create these jobs (there’s some old WIP in the toolforge-jobs branch)
Investigate why --command 'run-make random-item.shex' didn’t work, and possibly file a task about that – the envvars workaround works for now, but I’m not sure if it’s even available from the webservice, and it also has an inherent race condition so I’d prefer to avoid it

In T320140#9492415, @LucasWerkmeister wrote:

Investigate why --command 'run-make random-item.shex' didn’t work, and possibly file a task about that – the envvars workaround works for now, but I’m not sure if it’s even available from the webservice, and it also has an inherent race condition so I’d prefer to avoid it

Investigated a bit more and filed T356016. If necessary, I should be able to work around it by creating a new setup-* command that creates a run-make shell script, and then add a Procfile command that just runs this shell script. (The error happens if the Procfile command has arguments, so moving those arguments into the shell script should work.)

In T320140#9492742, @LucasWerkmeister wrote:

If necessary, I should be able to work around it by creating a new setup-* command that creates a run-make shell script, and then add a Procfile command that just runs this shell script.

Although, if I directly create Kubernetes jobs instead of Toolforge jobs (based on discussion in T285944, especially T285944#7233967 and T285944#7334210), then I think I don’t really need any Procfile command for make? I can just run make directly, with custom environment variables and working directory as part of the k8s job spec.

Continue working on the webservice code to actually create these jobs (there’s some old WIP in the toolforge-jobs branch)

I pushed a newer version of this to the k8s branch now (this commit), directly using Kubernetes instead of Toolforge Jobs. But I haven’t tested it at all yet, I’ll do that later (in the lucaswerkmeister-test tool).

Slst2020 changed the status of subtask T355997: Requesting additional Harbor / Toolforge Build Service disk quota for wd-shex-infer tool from Open to In Progress.Jan 29 2024, 10:56 AM

Slst2020 closed subtask T355997: Requesting additional Harbor / Toolforge Build Service disk quota for wd-shex-infer tool as Resolved.

LucasWerkmeister mentioned this in T357172: Tool user not allowed to read jobs/status in Kubernetes.Feb 9 2024, 7:07 PM

Alright, some more debugging and hacking later, I have a working version of the webservice. Job #10 in the lucaswerkmeister-test tool successfully ran to completion and produced valid ShExC. It’s still using the k8s branch, specifically this commit. Next I have to address all the TODOs in the k8s code there.

I’m also wondering again whether I should create the jobs via k8s directly or using toolforge-jobs after all. Right now I’m using k8s directly, and I eventually got it working, but it’s kinda tedious:

I have to manually set $TOOL_DATA_DIR
I have to manually source /layers/fagiani_apt/apt/.profile.d/000_apt.sh (Apt build pack)
I have to manually mount /data/project
I have to manually redirect stdout/stderr to filelog-like files

If T356377: [toolforge] simplify calling the different toolforge apis from within the containers happens soon, then maybe it would be easier to use. I think the main thing I’m missing from the toolforge-jobs CLI (I don’t know if it’s planned for the API or not) is the ability to specify a custom working directory… though that could be worked around with a cd command, I suppose.

LucasWerkmeister mentioned this in R2390:8b356ec8a8d2: Make k8s configurable.Feb 10 2024, 2:20 PM

LucasWerkmeister mentioned this in R2390:49b1f2dee021: Make k8s configurable.Feb 10 2024, 3:08 PM

LucasWerkmeister mentioned this in R2390:3372732d5de2: Make k8s configurable.Feb 10 2024, 3:10 PM

LucasWerkmeister mentioned this in R2390:6d5fd3e7056b: Use pip-tools to manage dependencies.Feb 10 2024, 4:16 PM

LucasWerkmeister mentioned this in R2390:e53ed0c14c7e: Set up build container image.

LucasWerkmeister mentioned this in R2390:474c7ecb92ea: Run jobs on Kubernetes.

Alright, and here’s the cleaned-up pull request to migrate from Grid Engine to Kubernetes.

It’s working on the lucaswerkmeister-test tool, as far as I can tell; the last thing I need is just a quota increase – on lucaswerkmeister-test I configured it to request 3Gi of memory for the jobs, but I’d like 8Gi like on the Grid. I’ll file a separate task for that.

LucasWerkmeister mentioned this in R2390:01856b0e969f: Run jobs on Kubernetes.Feb 10 2024, 5:04 PM

And I might need to tweak something about how the pods are created so they don’t linger around forever. (Unless there’s a limit and it’s just not been hit yet? I know for replicasets it keeps the last 8 or so by default.)

tools.lucaswerkmeister-test@tools-sgebastion-10:~/www/python/src$ kubectl get pods
NAME                                     READY   STATUS      RESTARTS   AGE
lucaswerkmeister-test-8565f6dcb7-fnk28   1/1     Running     0          56m
wd-shex-infer-10-hth9h                   0/1     Completed   0          20h
wd-shex-infer-12-hwgcw                   0/1     Completed   0          173m
wd-shex-infer-13-g2srt                   0/1     Completed   0          107m
wd-shex-infer-14-64zrx                   0/1     Completed   0          56m

In theory these would probably let me determine when the job actually finished, but I’m not sure I’m motivated to put that extra code together ^^

taavi closed subtask T357172: Tool user not allowed to read jobs/status in Kubernetes as Resolved.Feb 12 2024, 10:33 AM

As the migration deadline approaches, and I’m still blocked on T357209, I request that you don’t shut down my tool tomorrow until I can actually migrate it to Kubernetes.

Noted!

In T320140#9537568, @LucasWerkmeister wrote:

As the migration deadline approaches, and I’m still blocked on T357209, I request that you don’t shut down my tool tomorrow until I can actually migrate it to Kubernetes.

dcaro changed the status of subtask T356016: [builds-builder,jobs-api,upstream] Calling nontrivial Procfile commands with arguments results in confusing error (“no such file or directory”) from Open to In Progress.Feb 14 2024, 1:23 PM

dcaro changed the status of subtask T356016: [builds-builder,jobs-api,upstream] Calling nontrivial Procfile commands with arguments results in confusing error (“no such file or directory”) from In Progress to Stalled.Feb 14 2024, 2:27 PM

dcaro closed subtask T357209: Request increased memory quota for wd-shex-infer Toolforge tool as Resolved.Feb 16 2024, 3:33 PM

LucasWerkmeister reopened subtask T357209: Request increased memory quota for wd-shex-infer Toolforge tool as Open.Feb 16 2024, 7:11 PM

lucaswerkmeister merged https://gitlab.wikimedia.org/toolforge-repos/wd-shex-infer/-/merge_requests/1

Migrate from Grid Engine to Kubernetes

Mentioned in SAL (#wikimedia-cloud) [2024-02-17T15:23:53Z] <wmbot~lucaswerkmeister@tools-sgebastion-10> toolforge build start https://gitlab.wikimedia.org/toolforge-repos/wd-shex-infer/ # T320140

Mentioned in SAL (#wikimedia-cloud) [2024-02-17T15:46:01Z] <wmbot~lucaswerkmeister@tools-sgebastion-10> for setup in setup-mkdir setup-install-jena setup-install-fuseki setup-install-jena-binaries setup-install-fuseki-server setup-clone-rdf2graph setup-clone-rdfsimplecon setup-build-rdfsimplecon setup-build-rdf2graph setup-build-shex-exporter; do toolforge jobs run --wait --image tool-wd-shex-infer/tool-wd-shex-infer:latest --command $setup{,} || break; done # T320140

Mentioned in SAL (#wikimedia-cloud) [2024-02-17T15:47:43Z] <wmbot~lucaswerkmeister@tools-sgebastion-10> for setup in setup-mkdir setup-install-jena setup-install-fuseki setup-install-jena-binaries setup-install-fuseki-server setup-clone-rdf2graph setup-clone-rdfsimplecon setup-build-rdfsimplecon setup-build-rdf2graph setup-build-shex-exporter; do toolforge jobs run --wait --mount all --image tool-wd-shex-infer/tool-wd-shex-infer:latest --command $setup{,} || break; done # T320140

LucasWerkmeister mentioned this in R2390:2de1752de42b: Update Jena download URLs.Feb 17 2024, 4:53 PM

Mentioned in SAL (#wikimedia-cloud) [2024-02-17T16:53:34Z] <wmbot~lucaswerkmeister@tools-sgebastion-10> toolforge build start https://gitlab.wikimedia.org/toolforge-repos/wd-shex-infer/ # T320140

Mentioned in SAL (#wikimedia-cloud) [2024-02-17T17:16:29Z] <wmbot~lucaswerkmeister@tools-sgebastion-10> for setup in setup-mkdir setup-install-jena setup-install-fuseki setup-install-jena-binaries setup-install-fuseki-server setup-clone-rdf2graph setup-clone-rdfsimplecon setup-build-rdfsimplecon setup-build-rdf2graph setup-build-shex-exporter; do toolforge jobs run --wait --mount all --image tool-wd-shex-infer/tool-wd-shex-infer:latest --command $setup{,} || break; done # T320140

Setup’s looking good so far…

tools.wd-shex-infer@tools-sgebastion-10:~$ for setup in setup-mkdir setup-install-jena setup-install-fuseki setup-install-jena-binaries setup-install-fuseki-server setup-clone-rdf2graph setup-clone-rdfsimplecon setup-build-rdfsimplecon setup-build-rdf2graph setup-build-shex-exporter; do toolforge jobs run --wait --mount all --image tool-wd-shex-infer/tool-wd-shex-infer:latest --command $setup{,} || break; done # T320140
INFO: job 'setup-mkdir' completed
INFO: job 'setup-install-jena' completed
INFO: job 'setup-install-fuseki' completed
INFO: job 'setup-install-jena-binaries' completed
INFO: job 'setup-install-fuseki-server' completed
INFO: job 'setup-clone-rdf2graph' completed
INFO: job 'setup-clone-rdfsimplecon' completed
INFO: job 'setup-build-rdfsimplecon' completed
INFO: job 'setup-build-rdf2graph' completed
INFO: job 'setup-build-shex-exporter' completed
tools.wd-shex-infer@tools-sgebastion-10:~$ du -sh k8sroot/
284M    k8sroot/

Mentioned in SAL (#wikimedia-cloud) [2024-02-17T23:00:57Z] <wmbot~lucaswerkmeister@tools-sgebastion-10> deployed 2de1752de4 (T320140: move from Grid Engine to Kubernetes; includes fresh venv)

LucasWerkmeister mentioned this in T357209: Request increased memory quota for wd-shex-infer Toolforge tool.Feb 17 2024, 11:04 PM

Hm, the job ran now but something didn’t work:

$ kubectl logs pod/wd-shex-infer-101-mgxk9                                                                                                                                                                                                                   
sh: 1: cannot create /data/project/wd-shex-infer/wd-shex-infer-101.out: Permission denied

I previously got this error when the /data/project/ mount was missing, but it’s there according to kubectl get pod/wd-shex-infer-101-mgxk9 -o yaml…

You seem to be manually adding the volume mounts instead of relying the admission controller, and the code is not adding the kubernetes.wmcloud.org/nfs-mounted=true nodeSelector to only run on nodes where NFS is actually mounted.

I see… so the difference to the successful jobs in the test tool is just that I got unlucky with the placement this time. Thanks!

One more reason to use T356377 once it’s available, I guess. And in the meantime, if I understand correctly, I should probably add toolforge: tool to various labels? (That’s what seems to do the trick in the QuickCategories tool’s background runner deployment, at least.)

toolforge: tool will automatically mount all volumes and add the required config for that. It's not strictly required as you can add the mounts manually, but without it you indeed have a much higher chance of missing something subtle like this. (The reason you haven't seen this before is that currently all of our normal workers have NFS mounted. However an ingress node I was provisioning was accidentally left without the taint to only enable ingress workloads and yours got scheduled there since without the NFS node selector that was the node with most free resources available.)

Mentioned in SAL (#wikimedia-cloud) [2024-02-19T19:02:31Z] <wmbot~lucaswerkmeister@tools-sgebastion-10> deployed 85f97fad9c (add toolforge: tool label, T320140)

Looks like the required config also includes the TOOL_DATA_DIR env variable, so I can probably stop setting that explicitly. (Right now it actually shows up twice in kubectl get pod wd-shex-infer-102-dxrlz -o yaml.) Thanks!

LucasWerkmeister mentioned this in R2390:85f97fad9c92: Add toolforge: tool label to pod template metadata.Feb 19 2024, 7:07 PM

dcaro changed the status of subtask T356016: [builds-builder,jobs-api,upstream] Calling nontrivial Procfile commands with arguments results in confusing error (“no such file or directory”) from Stalled to In Progress.Feb 20 2024, 2:28 PM

Mentioned in SAL (#wikimedia-cloud) [2024-02-21T19:24:38Z] <wmbot~lucaswerkmeister@tools-sgebastion-10> deployed 8074aaeeef (mounts optional in config, T320140)

LucasWerkmeister mentioned this in R2390:4c6c5aa0d434: Make mounts optional and update example config.Feb 21 2024, 7:25 PM

LucasWerkmeister mentioned this in R2390:f5f0149b332a: Stop explicitly setting TOOL_DATA_DIR env.Feb 21 2024, 8:31 PM

Mentioned in SAL (#wikimedia-cloud) [2024-02-21T20:31:22Z] <wmbot~lucaswerkmeister@tools-sgebastion-10> deployed f5f0149b33 (don’t set TOOL_DATA_DIR explicitly, T320140)

dcaro changed the status of subtask T356016: [builds-builder,jobs-api,upstream] Calling nontrivial Procfile commands with arguments results in confusing error (“no such file or directory”) from In Progress to Stalled.Mar 5 2024, 1:15 PM

Hi @LucasWerkmeister, I don't see any more jobs running on the grid for this tool, is there anything left? Can we close this task if not?

Cheers!

I’d still like to be able to increase the requests (T357209 / T357881), but otherwise this is done. (Well, eventually I should remove the grid engine code, I guess.)

taavi closed subtask T357209: Request increased memory quota for wd-shex-infer Toolforge tool as Resolved.Mar 8 2024, 12:05 PM

Mentioned in SAL (#wikimedia-cloud) [2024-03-08T15:15:54Z] <wmbot~lucaswerkmeister@tools-sgebastion-10> bump requests.memory to 8G (T357209 / T320140)

Mentioned in SAL (#wikimedia-cloud) [2024-03-08T15:38:17Z] <wmbot~lucaswerkmeister@tools-sgebastion-10> deployed bbe49a5ff9 (fix job limits, remove grid engine code, T320140)

LucasWerkmeister mentioned this in R2390:bbe49a5ff915: Remove GridEngineJobRunner.Mar 8 2024, 3:41 PM

LucasWerkmeister mentioned this in R2390:0c9bfc08964d: Remove “under development” disclaimer.Mar 8 2024, 4:02 PM

It looks like the tool is still working with the new limits / requests, so I think we can call this done. Thanks everyone!

LucasWerkmeister closed this task as Resolved.Mar 8 2024, 6:21 PM

Migrate wd-shex-infer from Toolforge GridEngine to Toolforge Kubernetes
Closed, ResolvedPublic
Actions

Description

Details

Related Objects
Search...

Event Timeline

Migrate wd-shex-infer from Toolforge GridEngine to Toolforge KubernetesClosed, ResolvedPublicActions

Description

Details

Related ObjectsSearch...

Event Timeline

Migrate wd-shex-infer from Toolforge GridEngine to Toolforge Kubernetes
Closed, ResolvedPublic
Actions

Related Objects
Search...