Page MenuHomePhabricator

Migrate enwikt-translations from Toolforge GridEngine to Toolforge Kubernetes
Closed, DeclinedPublic

Description

Kindly migrate your tool(https://grid-deprecation.toolforge.org/t/enwikt-translations) from Toolforge GridEngine to Toolforge Kubernetes.

Toolforge GridEngine is getting deprecated.
See: https://techblog.wikimedia.org/2022/03/14/toolforge-and-grid-engine/

Please note that a volunteer may perform this migration if this has not been done after some time.
If you have already migrated this tool, kindly mark this as resolved.

If you would rather shut down this tool, kindly do so and mark this as resolved.

Useful Resources:
Migrating Jobs from GridEngine to Kubernetes
https://wikitech.wikimedia.org/wiki/Help:Toolforge/Jobs_framework#Grid_Engine_migration
Migrating Web Services from GridEngine to Kubernetes
https://wikitech.wikimedia.org/wiki/News/Toolforge_Stretch_deprecation#Move_a_grid_engine_webservice
Python
https://wikitech.wikimedia.org/wiki/News/Toolforge_Stretch_deprecation#Rebuild_virtualenv_for_python_users

Event Timeline

My apologies if this ticket comes as a surprise to you. In order to ensure WMCS can provide a stable, secure and supported platform, it’s important we migrate away from GridEngine. I want to assure you that while it is WMCS’s intention to shutdown GridEngine as outlined in the blog post https://techblog.wikimedia.org/2022/03/14/toolforge-and-grid-engine/, a shutdown date for GridEngine has not yet been set. The goal of the migration is to migrate as many tools as possible onto kubernetes and ensure as smooth a transition as possible for everyone. Once the majority of tools have migrated, discussion on a shutdown date is more appropriate. See T314664: [infra] Decommission the Grid Engine infrastructure.

As noted in https://techblog.wikimedia.org/2022/03/16/toolforge-gridengine-debian-10-buster-migration/ some use cases are already supported by kubernetes and should be migrated. If your tool can migrate, please do plan a migration. Reach out if you need help or find you are blocked by missing features. Most of all, WMCS is here to support you.

However, it’s possible your tool needs a mixed runtime environment or some other features that aren't yet present in https://techblog.wikimedia.org/2022/03/18/toolforge-jobs-framework/. We’d love to hear of this or any other blocking issues so we can work with you once a migration path is ready. Thanks for your hard work as volunteers and help in this migration!

I got an email about this months ago, but didn't understand which parts of my work on Toolforge needed to be changed. I do jsub jobs and a webservice on this tool. The jsub part clearly needs to use toolforge-jobs, but how do I determine whether the webservice needs to be changed somehow?

I got an email about this months ago, but didn't understand which parts of my work on Toolforge needed to be changed. I do jsub jobs and a webservice on this tool. The jsub part clearly needs to use toolforge-jobs, but how do I determine whether the webservice needs to be changed somehow?

if you do webservice status it should tell you whether the webservice is running on gridengine or kubernetes.
If it is on gridengine, take a look at this use case continuity table to see if a kubernetes-equivalent command works for you.
https://wikitech.wikimedia.org/wiki/News/Toolforge_Grid_Engine_deprecation#Use_case_continuity

My jsub command looks like this:

jsub -N db -cwd -mem 4G -m e -v PAGES_ARTICLES=/public/dumps/public/enwiktionary/20220720/*pages-articles.xml.bz2 -v MODULE_DB=modules.sqlite -v TRANSLATION_DB=translations.sqlite ./run.sh

I don't see equivalent argument for toolforge-jobs to -v to set environment variables for the shell script and to -cwd. I suppose I could use cd in the shell script for -cwd but is there a preferred way to set environment variables in the new framework? They are basically arguments, but I haven't done argument parsing in Bash.

I got an email about this months ago, but didn't understand which parts of my work on Toolforge needed to be changed. I do jsub jobs and a webservice on this tool. The jsub part clearly needs to use toolforge-jobs, but how do I determine whether the webservice needs to be changed somehow?

if you do webservice status it should tell you whether the webservice is running on gridengine or kubernetes.
If it is on gridengine, take a look at this use case continuity table to see if a kubernetes-equivalent command works for you.
https://wikitech.wikimedia.org/wiki/News/Toolforge_Grid_Engine_deprecation#Use_case_continuity

Good, I'm using Kubernetes for the webservice already.

If you don't want to put it run.sh, I think --command "cd foo && PAGES_ARTICLES=/public/dumps/public/enwiktionary/20220720/*pages-articles.xml.bz2 MODULE_DB=modules.sqlite TRANSLATION_DB=translations.sqlite ./run.sh" should work.

@JJMC89 Thank you. That looks like it'll work, and I'll try that when the pages-articles.xml comes out.

I'm so far unable to use the Rust toolchain in /data/project/rustup/rustup/.cargo/bin in the toolforge-jobs environment from inside my run.sh script.

The first error was that it couldn't find cargo because apparently /data/project/rustup/rustup/.cargo/bin wasn't in the $PATH of toolforge-jobs, though it is in the $PATH of enwikt-translations. I fixed that by setting PATH=$PATH:

toolforge-jobs run db --image tf-bullseye-std --mem 4Gi --emails onfinish --command "cd $HOME/db && PATH=$PATH PAGES_ARTICLES=/public/dumps/public/enwiktionary/20221020/*pages-articles.xml.bz2 MODULE_DB=modules.sqlite TRANSLATION_DB=translations.sqlite ./run.sh"

I then got the error error: rustup could not choose a version of cargo to run, because one wasn't specified explicitly, and no default is configured., so I selected the stable Rust toolchain in run.sh by replacing cargo with cargo +stable.

Then I started getting bizarre errors from cargo: error: toolchain 'stable-x86_64-unknown-linux-gnu' is not installed. So I double-checked that it was using the right cargo. When I put which cargo and which rustup into run.sh, it prints the paths of rustup and cargo in /data/project/rustup/rustup/.cargo/bin as expected.

But when I put rustup toolchain list into run.sh, my stdout file says no installed toolchains. When I run the same command in the enwikt-translations shell, it provides a list that includes the latest toolchain (1.64), which I want to use:

stable-x86_64-unknown-linux-gnu (default)
1.55-x86_64-unknown-linux-gnu
1.56-x86_64-unknown-linux-gnu
1.57-x86_64-unknown-linux-gnu
1.58-x86_64-unknown-linux-gnu
1.59-x86_64-unknown-linux-gnu
1.60-x86_64-unknown-linux-gnu
1.61-x86_64-unknown-linux-gnu
1.62-x86_64-unknown-linux-gnu
1.63-x86_64-unknown-linux-gnu
1.64-x86_64-unknown-linux-gnu

Somehow cargo and rustup when run from toolforge-jobs don't see all these installed toolchains. The toolchains are located in directories in /data/project/rustup/rustup/.rustup/toolchains. I tried adding /data/project/rustup/rustup/.rustup/toolchains/stable-x86_64-unknown-linux-gnu/bin to the PATH variable and got /bin/sh: 1: /data/project/rustup/rustup/.rustup/toolchains/stable-x86_64-unknown-linux-gnu/bin: Permission denied. I looked at the file permissions and they should allow all users to view the directory /data/project/rustup/rustup/.rustup/toolchains/stable-x86_64-unknown-linux-gnu/bin and execute cargo within it.

I tried running cargo directly with toolforge-jobs run test --image tf-bullseye-std --wait --command 'PATH=/data/project/rustup/rustup/.rustup/toolchains/stable-x86_64-unknown-linux-gnu/bin:$PATH cargo --version' and it succeeded. I tried running a minimized version with toolforge-jobs run test-cargo --image tf-bullseye-std --emails onfinish --wait --command "PATH=$PATH ~/test.sh" where test.sh was as follows and got the error: toolchain 'stable-x86_64-unknown-linux-gnu' is not installed error twice.

#! /usr/bin/env bash

which cargo
cargo +stable --version

I spent a few hours off and on trying to figure this out, and I have no more ideas. Odd that Cargo works when it's in the command submitted to toolforge-jobs, but not when it's in a script file. Pinging @Legoktm because he started the rustup project that I'm trying to use in toolforge-jobs.

I've run the task with jsub for now so that the website is up-to-date with the latest dump. I'll try it on toolforge-jobs again if anyone has an idea of how to get cargo to work.

Sorry @Erutuon, I missed your ping! (feel free to poke me on Matrix/IRC in the future). I think it's related to something with the ~/.profile not taking full effect via k8s but it does with the grid.

In any case, I would recommend not running cargo inside the container anyways, it should be built ahead of time and the container should invoke the binary directly. I am waiting on T194332 before migrating any of my Rust tools, so the idea is that you push -> build the container which compiles your Rust binary -> deploy, which executes that binary directly.

@Erutuon HI! We added some (admittedly simple) support for rust on the toolforge build service (https://wikitech.wikimedia.org/wiki/Help:Toolforge/Build_Service/My_first_Buildpack_Rust_tool), maybe that will help you get going with the migration, this way of building the image does not need you to pre-compile the toolchain or similar, but instead installs rust + runs cargo/etc. while building the new image.

If you try this and face any issues, feel free to ping me and I'll try to help.

I looked back at my first post and got the test.sh to run in toolforge-jobs when I set PATH=/data/project/rustup/rustup/.rustup/toolchains/stable-x86_64-unknown-linux-gnu/bin:$PATH and removed +stable from the command in test.sh. Maybe that means I can actually compile with toolforge-jobs now.

When I last tried to run my Rust program with jsub, it failed, so the tool has been running off old data for months. The error sounds to me like SQLite encountered an error in a filesystem operation:

Error: disk I/O error

Caused by:
    Error code 3850: I/O error in the advisory file locking layer

@Erutuon HI! We added some (admittedly simple) support for rust on the toolforge build service (https://wikitech.wikimedia.org/wiki/Help:Toolforge/Build_Service/My_first_Buildpack_Rust_tool), maybe that will help you get going with the migration, this way of building the image does not need you to pre-compile the toolchain or similar, but instead installs rust + runs cargo/etc. while building the new image.

If you try this and face any issues, feel free to ping me and I'll try to help.

That page describes a different process than I use for enwikt-translations. I have two Rust programs, one to generate a database and one to run the web server. I build enwiktionary-translations-db and run ~/db/run.sh, which runs three enwiktionary-translations-db subcommands to generate translations.sqlite from the latest pages-articles.xml.bz2 in about 30 minutes. Then I build enwiktionary-translations-server (if it has changed), copy translations.sqlite from the previous step into the production directory, and run enwiktionary-translations-server with service.template, which reads translations.sqlite generated in the previous step.

I don't know how to translate this into building an image, but I guess the first question is how to build the two Rust programs. The Buildpack Rust Tool instructions have you supply a repository URL to toolforge build start. That would be possible, but at the moment I have the repositories cloned in directories (db and server). Is there a way to build them in those directories rather than cloning the repositories? If it's not possible, would supplying the repository URLs mean I'm rebuilding the programs from scratch every time (which would take a long time)?

Sorry for the delay, I missed the comment.

I don't know how to translate this into building an image, but I guess the first question is how to build the two Rust programs. The Buildpack Rust Tool instructions have you supply a repository URL to toolforge build start. That would be possible, but at the moment I have the repositories cloned in directories (db and server). Is there a way to build them in those directories rather than cloning the repositories?

Not really no, the build process happens without access to NFS, so it git clones the repositories anew.

If it's not possible, would supplying the repository URLs mean I'm rebuilding the programs from scratch every time (which would take a long time)?

Yes, currently every build happens from scratch, so it takes a few minutes sometimes to build, we will add some caching and such in the future though.

That page describes a different process than I use for enwikt-translations. I have two Rust programs, one to generate a database and one to run the web server. I build enwiktionary-translations-db and run ~/db/run.sh, which runs three enwiktionary-translations-db subcommands to generate translations.sqlite from the latest pages-articles.xml.bz2 in about 30 minutes. Then I build enwiktionary-translations-server (if it has changed), copy translations.sqlite from the previous step into the production directory, and run enwiktionary-translations-server with service.template, which reads translations.sqlite generated in the previous step.

This sounds like it would map to one scheduled job and one webservice. Using different build service images.

The scheduled job can use a Procfile that has the entry generate-translations: ./run.sh, and an image built from the db repository (where the run.sh script resides if I understood), probably with an --image-name generate-translations for example.

When running, the cronjob mounts the home directory (under $TOOL_DATA_DIR) and you can generate there the translations.sqlite file, and restart the webservice to pick it up (toolforge webservice restart).

Then a webservice using a different image, with the Profile entry web: ./path/to/web/binary built from the server repository, and probably with the default name (tool-enwikt-translations) or something like --image-name webservice or similar.

If/when you have the code published in a public git repository, I can try to play with it and give better advice.

Hi @Erutuon! I don't see any more processes running on the grid for this tool, were you able to migrate? If so, you can close this task :)
If not, do you need any more help or guidance with the process? (I see you have a git repo, awesome :), that will help me a lot guiding you if you need it)

Cheers!

taavi subscribed.

The grid engine has been shut down, so I'm closing any remaining migration tasks as Declined. If you're still planning to migrate this tool, please re-open this task and add one or more active project tags to it. (If you need a project tag for your tool, those can be created via the Toolforge admin console.)