Page MenuHomePhabricator

Migrate heritage from Toolforge GridEngine to Toolforge Kubernetes
Open, Needs TriagePublic

Description

Kindly migrate your tool(https://grid-deprecation.toolforge.org/t/heritage) from Toolforge GridEngine to Toolforge Kubernetes.

Toolforge GridEngine is getting deprecated.
See: https://techblog.wikimedia.org/2022/03/14/toolforge-and-grid-engine/

Please note that a volunteer may perform this migration if this has not been done after some time.
If you have already migrated this tool, kindly mark this as resolved.

If you would rather shut down this tool, kindly do so and mark this as resolved.

Useful Resources:
Migrating Jobs from GridEngine to Kubernetes
https://wikitech.wikimedia.org/wiki/Help:Toolforge/Jobs_framework#Grid_Engine_migration
Migrating Web Services from GridEngine to Kubernetes
https://wikitech.wikimedia.org/wiki/News/Toolforge_Stretch_deprecation#Move_a_grid_engine_webservice
Python
https://wikitech.wikimedia.org/wiki/News/Toolforge_Stretch_deprecation#Rebuild_virtualenv_for_python_users

Event Timeline

My apologies if this ticket comes as a surprise to you. In order to ensure WMCS can provide a stable, secure and supported platform, it’s important we migrate away from GridEngine. I want to assure you that while it is WMCS’s intention to shutdown GridEngine as outlined in the blog post https://techblog.wikimedia.org/2022/03/14/toolforge-and-grid-engine/, a shutdown date for GridEngine has not yet been set. The goal of the migration is to migrate as many tools as possible onto kubernetes and ensure as smooth a transition as possible for everyone. Once the majority of tools have migrated, discussion on a shutdown date is more appropriate. See T314664: [infra] Decommission the Grid Engine infrastructure.

As noted in https://techblog.wikimedia.org/2022/03/16/toolforge-gridengine-debian-10-buster-migration/ some use cases are already supported by kubernetes and should be migrated. If your tool can migrate, please do plan a migration. Reach out if you need help or find you are blocked by missing features. Most of all, WMCS is here to support you.

However, it’s possible your tool needs a mixed runtime environment or some other features that aren't yet present in https://techblog.wikimedia.org/2022/03/18/toolforge-jobs-framework/. We’d love to hear of this or any other blocking issues so we can work with you once a migration path is ready. Thanks for your hard work as volunteers and help in this migration!

This is a reminder that the tool for which this ticket is created is still running on the Grid.
The grid is deprecated and all remaining tools need to migrate to Toolforge Kubernetes.

We've sent several emails to maintainers as we continue to make the move away from the Grid.
Many of the issues that have held users back from moving away from the Grid have been addressed in
the latest updates to Build Service. See: https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Changelog

You might find the following resources helpful in migrating your tool:

  1. https://wikitech.wikimedia.org/wiki/Help:Toolforge/Build_Service#Migrating_an_existing_tool
  2. https://wikitech.wikimedia.org/wiki/Help:Toolforge/Build_Service#Tutorials_for_popular_languages

Don't hesitate to reach out to us using this ticket or via any of our support channels

If you have already migrated this tool, kindly mark this ticket as 'resolved'
To do this, click on the 'Add Action' dropdown above the comment text box, select 'Change Status', then 'Resolved'.
Click 'Submit'

Thank you!

Don't sabotage the grid jobs. That wouldn't be appreciated.

@Multichill we will definitely not sabotage your grid jobs! If there are any issues stopping the migration for a specific tool, kindly let us know under each ticket and the team will take a look at it.

Please extend the deadline for this tool past the winter holidays to give us a chance at figuring out the issues around the remaining cron-jobs

The blocker here seems to be (@JeanFred please chime in if I misrepresented something):

  • We have a long running job (image categorization) which needs to be paused when a second job (data update) gets run since the latter locks the database.
  • This was managed in GridEngine but does not seem to be supported in Kubernetes.

So support would be appreciated in figuring out a way to work around this.

What does "pausing" mean here in practice? How is that implemented on the grid?

As far as I can see it's jstop/jsub of that other job.

I would recommend joining both jobs into the same one, k8s jobs will not overlap each other (@taavi correct me if I'm wrong, looking at https://gitlab.wikimedia.org/repos/cloud/toolforge/jobs-api/-/blob/main/tjf/job.py?ref_type=heads#L376), so you can be sure that there's only one instance running:

## all_in_one.sh
# simplified here, use whichever arguments you need, you can remove the jsub/jstop commands from the scripts too
./update_monuments.sh
./categorize_images.sh

This might run less often that you want, if it takes longer to run than the scheduled period, but maybe that's ok.

As far as I can see it's jstop/jsub of that other job.

I would recommend joining both jobs into the same one, k8s jobs will not overlap each other (@taavi correct me if I'm wrong, looking at https://gitlab.wikimedia.org/repos/cloud/toolforge/jobs-api/-/blob/main/tjf/job.py?ref_type=heads#L376), so you can be sure that there's only one instance running:

## all_in_one.sh
# simplified here, use whichever arguments you need, you can remove the jsub/jstop commands from the scripts too
./update_monuments.sh
./categorize_images.sh

This might run less often that you want, if it takes longer to run than the scheduled period, but maybe that's ok.

Thanks.

Would that run them sequentially? The issue is that update_monuments.sh needs to be run daily(ish) during the competition whereas categorize_images.sh normally ends up with a backlog which it slowly chews through. So we can not wait on categorize_images.sh to complete before running update_monuments.sh again.

As far as I can see it's jstop/jsub of that other job.

I would recommend joining both jobs into the same one, k8s jobs will not overlap each other (@taavi correct me if I'm wrong, looking at https://gitlab.wikimedia.org/repos/cloud/toolforge/jobs-api/-/blob/main/tjf/job.py?ref_type=heads#L376), so you can be sure that there's only one instance running:

## all_in_one.sh
# simplified here, use whichever arguments you need, you can remove the jsub/jstop commands from the scripts too
./update_monuments.sh
./categorize_images.sh

This might run less often that you want, if it takes longer to run than the scheduled period, but maybe that's ok.

Thanks.

Would that run them sequentially?

It would run sequentially yes

The issue is that update_monuments.sh needs to be run daily(ish) during the competition whereas categorize_images.sh normally ends up with a backlog which it slowly chews through. So we can not wait on categorize_images.sh to complete before running update_monuments.sh again.

I see, so one way to work around that would be using timeout, it's a bit ugly, and you'll have to handle return codes to get the right answers, something like

## all_in_one.sh
set -o errexit
# simplified here, use whichever arguments you need, you can remove the jsub/jstop commands from the scripts too
./update_monuments.sh
# timeout in a bit less than 1 hour, to let the next job trigger
timeout 3500 ./categorize_images.sh || {
    rc="$?"
    if [[ "$rc" == "124" ]]; then
        echo "stopped categorize items"
        exit 0
    fi
    echo "ERROR: categorize_images failed to run"
    exit "$rc"
}

You would need to keep in sync the timeout there with the schedule of the job. The nicer alternative (imo) would be to synchronize both jobs through the db or similar (redis maybe), or join them together in a continuous job somehow. Those last two options require a bit of coding and maybe refactoring the current code.

Hi! Have you had a chance to look into migrating already? I still see some jobs running on the grid.

If you want any more guidance or help, can you please publish the code for your tool on a public repository (like gitlab.wikimedia.org), that would help me greatly know how to guide you, and make your tool compliant with the TOU of toolforge.

Remember that the grid is being turned off next week (14th of March).

Cheers!

The grid engine has been shut down, so I'm closing any remaining migration tasks as Declined. If you're still planning to migrate this tool, please re-open this task and add one or more active project tags to it. (If you need a project tag for your tool, those can be created via the Toolforge admin console.)

can you please publish the code for your tool on a public repository (like gitlab.wikimedia.org), that would help me greatly know how to guide you, and make your tool compliant with the TOU of toolforge.

Heritage’s source code has been available on Gerrit, mirrored on both Github and Phabricator (rTHER), for a decade, and the repo is even linked from the Wiki-Loves-Monuments-Database project which this task is tagged with. Frankly, I could have done without the insinuation that this project would not be following the Toolforge Terms of Use.

can you please publish the code for your tool on a public repository (like gitlab.wikimedia.org), that would help me greatly know how to guide you, and make your tool compliant with the TOU of toolforge.

Heritage’s source code has been available on Gerrit, mirrored on both Github and Phabricator (rTHER), for a decade, and the repo is even linked from the Wiki-Loves-Monuments-Database project which this task is tagged with. Frankly, I could have done without the insinuation that this project would not be following the Toolforge Terms of Use.

Thanks @JeanFred, good to know :)
Sorry if I was mistaken, I had checked both toolhub and toolsadmin for information on where to find the code repository and there's nothing there, I'll add it to the toolsadmin toolinfo so next time it's easy to find out.

I don't see any license information in the repository (or in toolsadmin/toolhub), what is the license that applies to that code?

Thanks for the answer!:

Here’s the toolhub record: https://toolhub.wikimedia.org/tools/heritage-api

I don't see any license information in the repository (or in toolsadmin/toolhub), what is the license that applies to that code?

Well well well, perhaps I should have been a bit less mouthy about the ToU compliance here − we never followed up on T174633: Set license on heritage repo >_>

Thanks for the answer!:

Here’s the toolhub record: https://toolhub.wikimedia.org/tools/heritage-api

I see where my confusion came from, the id in toolhub and toolforge don't match there (heritage on toolforge, heritage-api on toolhub), that breaks some assumptions I had about toolforge<->toolhub (there's even two toolhub tools coming from that one toolforge one, https://toolhub.wikimedia.org/tools/heritage-daily-uploads).

This gives me a completely new view of the relationship between toolforge and toolhub, interesting 🤔

Status report (see also https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools.heritage/SAL)

  • Recreated the venv under k8s
  • Tried to run a one-off job using toolforge jobs run update-monuments-min --command /data/project/heritage/bin/update_monuments_min.sh --image python3.7

This fails with

    import erfgoedbot.common as common
ModuleNotFoundError: No module named 'erfgoedbot'

Not too sure what sorts of import shenanigans I need to change here >_> Let’s see

Also, I found another fundamental problem: update_monuments_min.sh (as well as update_monuments.sh for that matter) does two things: running the Python process to harvest the wikis into the Mysql database ; and then updating the monuments_all table using $MYSQL_BIN -h $DB_SERVER $DATABASE < $ERFGOED_PATH/sql/fill_table_monuments_all.sql. That second step fails with mysql: command not found.

Some options I can think of:

  • running some custom image that has both Python and mysql
  • reimplement/wrap the call to fill_table_monuments_all.sql in Python
  • run the mysql step in a separate job in a mariadb container − but not sure how we would orchestrate both jobs together.

(Also, yes, Py37 is deprecated ; will look into upgrading to 39)

I got something working, will wait overnight to see if it went well, and will then send to Gerrit the necessary changes.

There's still the fundamental problem that harvesting is two sequential operations, which cannot run in the same (default) image (because of different runtimes). It does not seem possible to set things up so that the successful end of a job triggers the start of another job? (which sounds like a pretty basic things tbh).

Reading the docs, the closest thing might be https://wikitech.wikimedia.org/wiki/Help:Toolforge/Jobs_framework#Configuring_internal_domain_names_for_your_jobs − starting both jobs at the same time, but have the latter communicate with the former?

Or perhaps more simply − scrap the communication − have the latter job wait for some signal, say the presence of some file on disk, which is created at the end of the former job? Sounds fiddly but like it could work.

Change #1065124 had a related patch set uploaded (by Jean-Frédéric; author: Jean-Frédéric):

[labs/tools/heritage@master] Use toolforge-jobs to install requirements during deployment

https://gerrit.wikimedia.org/r/1065124

I got something working, will wait overnight to see if it went well, and will then send to Gerrit the necessary changes.

There's still the fundamental problem that harvesting is two sequential operations, which cannot run in the same (default) image (because of different runtimes). It does not seem possible to set things up so that the successful end of a job triggers the start of another job? (which sounds like a pretty basic things tbh).

Reading the docs, the closest thing might be https://wikitech.wikimedia.org/wiki/Help:Toolforge/Jobs_framework#Configuring_internal_domain_names_for_your_jobs − starting both jobs at the same time, but have the latter communicate with the former?

Or perhaps more simply − scrap the communication − have the latter job wait for some signal, say the presence of some file on disk, which is created at the end of the former job? Sounds fiddly but like it could work.

A couple more options:

Change #1065124 merged by jenkins-bot:

[labs/tools/heritage@master] Use toolforge-jobs to install requirements during deployment

https://gerrit.wikimedia.org/r/1065124

That sounds promising, trying that out. Quick question: the jobsNewJob schema does not seem to have a wait parameter − that means starting a new job via the API is necessarily fire-and-forget, correct?

I was hoping to avoid learning even more new Toolforge stuff, but alas, that might be the best bet ^_^'

That sounds promising, trying that out. Quick question: the jobsNewJob schema does not seem to have a wait parameter − that means starting a new job via the API is necessarily fire-and-forget, correct?

I simplified my requirements, and now have a prototype jobs running which finishes by API-triggering another job, which in turn finishes by triggering two more jobs. Let’s see how that goes :)

That sounds promising, trying that out. Quick question: the jobsNewJob schema does not seem to have a wait parameter − that means starting a new job via the API is necessarily fire-and-forget, correct?

Yep, the cli just does polling after creating the job (https://gitlab.wikimedia.org/repos/cloud/toolforge/jobs-cli/-/blob/main/tjf_cli/cli.py?ref_type=heads#L633)

Change #1069284 had a related patch set uploaded (by Jean-Frédéric; author: Jean-Frédéric):

[labs/tools/heritage@master] Rewrite jobs shell scripts for Toolforge Jobs framework

https://gerrit.wikimedia.org/r/1069284

Change #1069284 merged by jenkins-bot:

[labs/tools/heritage@master] Rewrite jobs shell scripts for Toolforge Jobs framework

https://gerrit.wikimedia.org/r/1069284