Page MenuHomePhabricator

Migrate heritage from Toolforge GridEngine to Toolforge Kubernetes
Open, Needs TriagePublic

Description

Kindly migrate your tool(https://grid-deprecation.toolforge.org/t/heritage) from Toolforge GridEngine to Toolforge Kubernetes.

Toolforge GridEngine is getting deprecated.
See: https://techblog.wikimedia.org/2022/03/14/toolforge-and-grid-engine/

Please note that a volunteer may perform this migration if this has not been done after some time.
If you have already migrated this tool, kindly mark this as resolved.

If you would rather shut down this tool, kindly do so and mark this as resolved.

Useful Resources:
Migrating Jobs from GridEngine to Kubernetes
https://wikitech.wikimedia.org/wiki/Help:Toolforge/Jobs_framework#Grid_Engine_migration
Migrating Web Services from GridEngine to Kubernetes
https://wikitech.wikimedia.org/wiki/News/Toolforge_Stretch_deprecation#Move_a_grid_engine_webservice
Python
https://wikitech.wikimedia.org/wiki/News/Toolforge_Stretch_deprecation#Rebuild_virtualenv_for_python_users

Event Timeline

My apologies if this ticket comes as a surprise to you. In order to ensure WMCS can provide a stable, secure and supported platform, it’s important we migrate away from GridEngine. I want to assure you that while it is WMCS’s intention to shutdown GridEngine as outlined in the blog post https://techblog.wikimedia.org/2022/03/14/toolforge-and-grid-engine/, a shutdown date for GridEngine has not yet been set. The goal of the migration is to migrate as many tools as possible onto kubernetes and ensure as smooth a transition as possible for everyone. Once the majority of tools have migrated, discussion on a shutdown date is more appropriate. See T314664: [infra] Decommission the Grid Engine infrastructure.

As noted in https://techblog.wikimedia.org/2022/03/16/toolforge-gridengine-debian-10-buster-migration/ some use cases are already supported by kubernetes and should be migrated. If your tool can migrate, please do plan a migration. Reach out if you need help or find you are blocked by missing features. Most of all, WMCS is here to support you.

However, it’s possible your tool needs a mixed runtime environment or some other features that aren't yet present in https://techblog.wikimedia.org/2022/03/18/toolforge-jobs-framework/. We’d love to hear of this or any other blocking issues so we can work with you once a migration path is ready. Thanks for your hard work as volunteers and help in this migration!

This is a reminder that the tool for which this ticket is created is still running on the Grid.
The grid is deprecated and all remaining tools need to migrate to Toolforge Kubernetes.

We've sent several emails to maintainers as we continue to make the move away from the Grid.
Many of the issues that have held users back from moving away from the Grid have been addressed in
the latest updates to Build Service. See: https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Changelog

You might find the following resources helpful in migrating your tool:

  1. https://wikitech.wikimedia.org/wiki/Help:Toolforge/Build_Service#Migrating_an_existing_tool
  2. https://wikitech.wikimedia.org/wiki/Help:Toolforge/Build_Service#Tutorials_for_popular_languages

Don't hesitate to reach out to us using this ticket or via any of our support channels

If you have already migrated this tool, kindly mark this ticket as 'resolved'
To do this, click on the 'Add Action' dropdown above the comment text box, select 'Change Status', then 'Resolved'.
Click 'Submit'

Thank you!

Don't sabotage the grid jobs. That wouldn't be appreciated.

@Multichill we will definitely not sabotage your grid jobs! If there are any issues stopping the migration for a specific tool, kindly let us know under each ticket and the team will take a look at it.

Please extend the deadline for this tool past the winter holidays to give us a chance at figuring out the issues around the remaining cron-jobs

The blocker here seems to be (@JeanFred please chime in if I misrepresented something):

  • We have a long running job (image categorization) which needs to be paused when a second job (data update) gets run since the latter locks the database.
  • This was managed in GridEngine but does not seem to be supported in Kubernetes.

So support would be appreciated in figuring out a way to work around this.

What does "pausing" mean here in practice? How is that implemented on the grid?

As far as I can see it's jstop/jsub of that other job.

I would recommend joining both jobs into the same one, k8s jobs will not overlap each other (@taavi correct me if I'm wrong, looking at https://gitlab.wikimedia.org/repos/cloud/toolforge/jobs-api/-/blob/main/tjf/job.py?ref_type=heads#L376), so you can be sure that there's only one instance running:

## all_in_one.sh
# simplified here, use whichever arguments you need, you can remove the jsub/jstop commands from the scripts too
./update_monuments.sh
./categorize_images.sh

This might run less often that you want, if it takes longer to run than the scheduled period, but maybe that's ok.

As far as I can see it's jstop/jsub of that other job.

I would recommend joining both jobs into the same one, k8s jobs will not overlap each other (@taavi correct me if I'm wrong, looking at https://gitlab.wikimedia.org/repos/cloud/toolforge/jobs-api/-/blob/main/tjf/job.py?ref_type=heads#L376), so you can be sure that there's only one instance running:

## all_in_one.sh
# simplified here, use whichever arguments you need, you can remove the jsub/jstop commands from the scripts too
./update_monuments.sh
./categorize_images.sh

This might run less often that you want, if it takes longer to run than the scheduled period, but maybe that's ok.

Thanks.

Would that run them sequentially? The issue is that update_monuments.sh needs to be run daily(ish) during the competition whereas categorize_images.sh normally ends up with a backlog which it slowly chews through. So we can not wait on categorize_images.sh to complete before running update_monuments.sh again.

As far as I can see it's jstop/jsub of that other job.

I would recommend joining both jobs into the same one, k8s jobs will not overlap each other (@taavi correct me if I'm wrong, looking at https://gitlab.wikimedia.org/repos/cloud/toolforge/jobs-api/-/blob/main/tjf/job.py?ref_type=heads#L376), so you can be sure that there's only one instance running:

## all_in_one.sh
# simplified here, use whichever arguments you need, you can remove the jsub/jstop commands from the scripts too
./update_monuments.sh
./categorize_images.sh

This might run less often that you want, if it takes longer to run than the scheduled period, but maybe that's ok.

Thanks.

Would that run them sequentially?

It would run sequentially yes

The issue is that update_monuments.sh needs to be run daily(ish) during the competition whereas categorize_images.sh normally ends up with a backlog which it slowly chews through. So we can not wait on categorize_images.sh to complete before running update_monuments.sh again.

I see, so one way to work around that would be using timeout, it's a bit ugly, and you'll have to handle return codes to get the right answers, something like

## all_in_one.sh
set -o errexit
# simplified here, use whichever arguments you need, you can remove the jsub/jstop commands from the scripts too
./update_monuments.sh
# timeout in a bit less than 1 hour, to let the next job trigger
timeout 3500 ./categorize_images.sh || {
    rc="$?"
    if [[ "$rc" == "124" ]]; then
        echo "stopped categorize items"
        exit 0
    fi
    echo "ERROR: categorize_images failed to run"
    exit "$rc"
}

You would need to keep in sync the timeout there with the schedule of the job. The nicer alternative (imo) would be to synchronize both jobs through the db or similar (redis maybe), or join them together in a continuous job somehow. Those last two options require a bit of coding and maybe refactoring the current code.

Hi! Have you had a chance to look into migrating already? I still see some jobs running on the grid.

If you want any more guidance or help, can you please publish the code for your tool on a public repository (like gitlab.wikimedia.org), that would help me greatly know how to guide you, and make your tool compliant with the TOU of toolforge.

Remember that the grid is being turned off next week (14th of March).

Cheers!

The grid engine has been shut down, so I'm closing any remaining migration tasks as Declined. If you're still planning to migrate this tool, please re-open this task and add one or more active project tags to it. (If you need a project tag for your tool, those can be created via the Toolforge admin console.)