Page MenuHomePhabricator

[jobs-api,infra] upgrade all the existing toolforge jobs to the latest job version
Open, In Progress, HighPublic

Description

This will allow us to get rid of a bunch of code that tries to handle all the different versions the job spec has been through, and help us continue evolving it.

List of affected tools:

1actrial
2adamant
3admin
4ahechtbot
5air7538tools
6alertlive
7arkivbot
8aswnbot
9aw-gerrit-gitlab-bridge
10bothasava
11botorder
12brandonbot
13contribstats
14croptool
15csp-report
16danmicholobot
17dannys712-bot
18deployment-calendar
19dewikinews-rss
20dexbot
21dow
22dykautobot
23earwigbot
24emijrpbot
25erwin85
26featured-content-bot
27ffbot
28fist
29fontcdn
30forrestbot
31galobot
32gerakibot
33gerrit-reviewer-bot
34h78c67c-bot
35hay
36hewiki-tools
37highly-agitated-pages
38itwiki
39itwiki-scuola-italiana
40jackbot
41jarry-common
42jorobot
43kian
44lists
45logoscope
46magnustools
47maintgraph
48map-of-monuments
49mitmachen
50mjolnir
51most-wanted
52nlwiki-herhaalbot
53non-robot
54openstack-browser
55pagepile
56pangolinbot1
57patrocle
58phabbot
59phabsearchemail
60phansearch
61phpcs
62pickme
63quest
64random-featured
65rembot
66sdbot
67search-filters
68sergobot-statistics
69shex-simple
70socksfinder
71sourcemd
72spur
73status
74svbot2
75svgcheck
76sz-iwbot
77technischewuensche
78tf-image-bot
79thanatos
80thanks
81thesandbot
82tnt-dev
83toolhub-extension-demo
84toolhunt-api
85tools-edit-count
86top25reportbot
87topicmatcher
88trainbow
89tutor
90typo-fixer
91update-1lib1ref
92vicbot2
93video2commons
94wd-flaw-finder
95wdumps
96welcomebot
97wgmc
98wiki-patrimonio
99wiki-stat-portal
100wikicup
101wikidata-game
102wikidata-todo
103wikijournalbot
104wikilinkbot
105wikiloves
106wikiprojectlist
107wikivoyage
108wm-domains
109wmch
110wmde-access
111ws-cat-browser
112zhmrtbot
113zhwiki-teleirc

How to recreate your jobs

me@mylaptop$ ssh login.toolforge.org 


<myuser>@tools-bastion-15:~$ become <mytool>

tools.<mytool>@tools-bastion-15:~$ toolforge jobs dump recreating_jobs.yaml

tools.<mytool>@tools-bastion-15:~$ cat recreating_jobs.yaml
## Double check that all your jobs are there and correctly configured


## Now in the order you want
tools.<mytool>@tools-bastion-15:~$ toolforge jobs delete <myjob>
tools.<mytool>@tools-bastion-15:~$ toolforge jobs load recreating_jobs.yaml --job <myjob>

## Repeat with the rest of jobs, or if you just want to start all the rest in whichever order
tools.<mytool>@tools-bastion-15:~$ toolforge jobs load recreating_jobs.yaml


tools.<mytool>@tools-bastion-15:~$ toolforge jobs list -o long
## Double check everything looks ok

What we will do automatically

This deletes all the jobs and recreates them in no specific order:

tools.<mytool>@tools-bastion-15:~$ toolforge jobs dump recreating_jobs.yaml
tools.<mytool>@tools-bastion-15:~$ toolforge jobs flush
tools.<mytool>@tools-bastion-15:~$ toolforge jobs load recreating_jobs.yaml

Related Objects

Event Timeline

dcaro triaged this task as High priority.

We did a bit of research today about the size/scope of the problem:

aborrero@tools-sgebastion-11:~$ kubectl sudo get cronjobs --all-namespaces --selector=app.kubernetes.io/managed-by=toolforge-jobs-framework | wc -l
1767
aborrero@tools-sgebastion-11:~$ kubectl sudo get cronjobs --all-namespaces --selector=app.kubernetes.io/version="1" | wc -l
402
aborrero@tools-sgebastion-11:~$ kubectl sudo get cronjobs --all-namespaces --selector=app.kubernetes.io/version="2" | wc -l
1366

aborrero@tools-sgebastion-11:~$ kubectl sudo get deploy --all-namespaces --selector=app.kubernetes.io/managed-by=toolforge-jobs-framework | wc -l
129
aborrero@tools-sgebastion-11:~$ kubectl sudo get deploy --all-namespaces --selector=app.kubernetes.io/version="1" | wc -l
22
aborrero@tools-sgebastion-11:~$ kubectl sudo get deploy --all-namespaces --selector=app.kubernetes.io/version="2" | wc -l
108

The most recent changes made to toolforge jobs load * makes this a bit easier to do.
Basically, since the jobs load operation now does it's comparison on the k8s level, we can simply dump and load each job that needs updating, and it will be updated automatically (Or dump and load all jobs of the tool that contains the job that needs updating, and jobs load * will only update the jobs that need to be migrated to the latest version.
Listing detailed steps below:

Step 1

create create_tools_migrations_list.sh. This file get's the name of the tools that have jobs that require migration (for both scheduled and continuous jobs):

#!/bin/bash

LOG_FILE="/tmp/tools-migration/tools_migration.log"

exec > >(tee -a "$LOG_FILE") 2>&1

OUTPUT_FILE="/tmp/tools-migration/tools_migration_list.txt"

mkdir -p "$(dirname "$OUTPUT_FILE")"

CRONJOBS=$(kubectl get cronjobs -A -l app.kubernetes.io/managed-by=toolforge-jobs-framework -l app.kubernetes.io/version=1 -o jsonpath='{range .items[*]}{.metadata.namespace}{"\n"}{end}')
DEPLOYMENTS=$(kubectl get deployments -A -l app.kubernetes.io/managed-by=toolforge-jobs-framework -l app.kubernetes.io/version=1 -o jsonpath='{range .items[*]}{.metadata.namespace}{"\n"}{end}')

ALL_NAMESPACES="$CRONJOBS"$'\n'"$DEPLOYMENTS"

TOOLS=$(echo "$ALL_NAMESPACES" | sed 's/^tool-//' | sort | uniq)

echo "$TOOLS" > "$OUTPUT_FILE"
chmod a+r "$OUTPUT_FILE"

echo "names of tools that require migration has been written to $OUTPUT_FILE"

Step 2
Create create_tools_jobs_dump.sh. This script will create a dump of the jobs of all the tools gotten in Step 1 by becoming the respective tools and running toolforge jobs dump -f <dump_file_path>.

#!/bin/bash

LOG_FILE="/tmp/tools-migration/tools_migration.log"

exec > >(tee -a "$LOG_FILE") 2>&1

TOOLS_MIGRATION_LIST="/tmp/tools-migration/tools_migration_list.txt"
DUMPS_DIR="/tmp/tools-migration/dumps/"

mkdir -p "$DUMPS_DIR"
chmod a+rw "$DUMPS_DIR"

if [[ ! -f "$TOOLS_MIGRATION_LIST" ]]; then
    echo "Error: $TOOLS_MIGRATION_LIST does not exist."
    exit 1
fi

while IFS= read -r tool_name; do
    if [[ -z "$tool_name" ]]; then
        continue
    fi

    dump_file="${DUMPS_DIR}${tool_name}.yaml"

    sudo -i -u "tools.${tool_name}" -- bash -c "toolforge jobs dump -f $dump_file"

    if [[ $? -eq 0 ]]; then
        echo "Successfully dumped jobs for tool_name: $tool_name"
    else
        echo "Failed to dump jobs for tool_name: $tool_name"
    fi
done < "$TOOLS_MIGRATION_LIST"

echo "All operations completed."

Step 3
Create migrate_tool_jobs_to_latest_version.sh script. This script takes the dumps created in Step 2, becomes the respective tools and runs toolforge jobs load <dump_file_path>. This is really all we need to do to migrate the whole jobs

#!/bin/bash

LOG_FILE="/tmp/tools-migration/tools_migration.log"

exec > >(tee -a "$LOG_FILE") 2>&1

DUMPS_DIR="/tmp/tools-migration/dumps/"

if [[ ! -d "$DUMPS_DIR" ]]; then
    echo "Error: $DUMPS_DIR does not exist."
    exit 1
fi
chmod a+r "$DUMPS_DIR"

FILES=$(ls "$DUMPS_DIR" | grep '\.yaml$')

if [[ -z "$FILES" ]]; then
    echo "No .yaml files found in $DUMPS_DIR."
    exit 0
fi

for file in $FILES; do
    tool_name="${file%.yaml}"

    file_path="${DUMPS_DIR}${file}"

    sudo -i -u "tools.${tool_name}" -- bash -c "toolforge jobs load $file_path"

    if [[ $? -eq 0 ]]; then
        echo "Successfully loaded jobs for tool: $tool_name"
    else
        echo "Failed to load jobs for tool: $tool_name"
    fi
done

echo "All operations completed."

To test the scripts:

  1. ssh login.toolforge.org and create these scripts on your home dir. Note that you need to be a toolforge admin for this to work.
  2. sudo -i -u tools.maintain-harbor
  3. toolforge jobs flush to make sure no job exists.
  4. kubectl create -f test_cronjob_migration.yaml -f test_cont_job_migration.yaml. These files have already been created on the maintain-harbor tool. I specifically made them to use the old version 1 format so that's already done for you (you can inspect the files to make sure)
  5. now toolforge jobs list should show that these jobs have been created.
  6. write the name maintain-harbor into the file /tmp/tools-migration/tools_migration_list.txt and save it. This is because we are not going to run the script create_tools_migrations_list.sh since that will populate /tmp/tools-migration/tools_migration_list.txt with the name of the tools that needs migration. We don't want that so we do that part manually.
  7. execute ./create_tools_jobs_dump.sh. You must have created this file in 1.
  8. execute ./migrate_tool_jobs_to_latest_version.sh. You must have created this file in 1.
  9. You can check the k8s spec of these jobs to verify that they have been updated to the latest job version.
  10. check /tmp/tools-migration/tools_migration.log for logs if there is a need.
Raymond_Ndibe changed the task status from Open to In Progress.Apr 14 2025, 2:21 AM

btw we need to merge the patch submitted for https://phabricator.wikimedia.org/T391786 before applying this migration or else some tools will be left out.

@dcaro advised we email the tool owners and given them a chance to do this themselves, then wait for like 1 month before doing it ourselves for the tools that didn't respond

Thanks David for that. I was a bit unsure about where to put them given that I'm not sure it's something we'd like to keep around for long after the migration is done.

Thanks David for that. I was a bit unsure about where to put them given that I'm not sure it's something we'd like to keep around for long after the migration is done.

I think it's ok to keep it there indefinitely, it's useful in the future to be able to review what process was used to do certain migration in case there's any issues and also helps as an example for future ones. And it does not really use much space :)

@dcaro advised we email the tool owners and given them a chance to do this themselves, then wait for like 1 month before doing it ourselves for the tools that didn't respond

Rethinking this a bit, I think 1 month might be too much, one week might be enough, restarting a cronjob/continuous job should not be painful (would be like moving it from one worker to another).

So sending an email to cloud-announce kinda like:

We have to migrate old toolforge job definitions to the newer format, for that we will have to recreate some of the jobs for some of the tools.

This will happen automatically on <insert date> for all the tools still using the old version, note that it will force continuous jobs to restart and stop running cronjobs (they will retrigger on the next run).

If you want to manually restart your jobs in a controlled manner, you can do it earlier running:

> toolforge jobs dump > myjobs.yaml
> toolforge jobs flush
> toolforge jobs load myjobs.yaml

Then your tool will not be automatically migrated as that will already recreate it in the newer version format.

Wdyt?

@dcaro advised we email the tool owners and given them a chance to do this themselves, then wait for like 1 month before doing it ourselves for the tools that didn't respond

Rethinking this a bit, I think 1 month might be too much, one week might be enough, restarting a cronjob/continuous job should not be painful (would be like moving it from one worker to another).

So sending an email to cloud-announce kinda like:

We have to migrate old toolforge job definitions to the newer format, for that we will have to recreate some of the jobs for some of the tools.

This will happen automatically on <insert date> for all the tools still using the old version, note that it will force continuous jobs to restart and stop running cronjobs (they will retrigger on the next run).

If you want to manually restart your jobs in a controlled manner, you can do it earlier running:

> toolforge jobs dump > myjobs.yaml
> toolforge jobs flush
> toolforge jobs load myjobs.yaml

Then your tool will not be automatically migrated as that will already recreate it in the newer version format.

Wdyt?

I have the same opinion too. You are correct it's almost like moving to a different nodes. We are not messing with their environment or anything.
We probably don't need to flush. That way jobs-api will completely ignore jobs that are up-to-date and only recreating those that requires it.

I have the same opinion too. You are correct it's almost like moving to a different nodes. We are not messing with their environment or anything.
We probably don't need to flush. That way jobs-api will completely ignore jobs that are up-to-date and only recreating those that requires it.

If we don't need to flush we might not be stopping the cronjobs at all :), that's good, let me test

  • tested -- the scripts don't actually flush, and the cronjob still get stopped if their are running, well, good enough

I have the same opinion too. You are correct it's almost like moving to a different nodes. We are not messing with their environment or anything.
We probably don't need to flush. That way jobs-api will completely ignore jobs that are up-to-date and only recreating those that requires it.

If we don't need to flush we might not be stopping the cronjobs at all :), that's good, let me test

Unconcerned crons will definitely not be touched at all, which is what we want. But for the crons that needs updating, they'll be removed and recreated (pods and all), like we expect

Job version upgrade email draft:

[Cloud-announce] Old Toolforge Jobs Upgrade To V2 on 2025-11-20

To whom it may concern, this is to let you know that we will be upgrading all old toolforge jobs to version 2 on/after 2025-11-20.

Toolforge jobs are versioned. Our future vision for Toolforge requires that all jobs use V2 spec,  and as a result, we are dropping support for V1.

If your tool is in the attached document, we advise that you upgrade your job versions to V2. This is as simple as:
1. Dumping your jobs into a yaml with `toolforge jobs dump jobs.yaml`
2. Verifying that all your jobs have been dumped into jobs.yaml
3. Running `toolforge jobs flush` to delete the jobs from toolforge
4. Loading them back into toolforge with `toolforge jobs load jobs.yaml`

You have a grace period until 2025-11-20, after which we'll upgrade all the affected jobs ourselves.

_______________________________________________

Cloud-announce mailing list -- cloud-announce@lists.wikimedia.org

List information: https://lists.wikimedia.org/postorius/lists/cloud-announce.lists.wikimedia.org/

Affected tools:

actrial
adamant
ahechtbot
air7538tools
alertlive
arkivbot
aswnbot
aw-gerrit-gitlab-bridge
bothasava
botorder
brandonbot
contribstats
croptool
csp-report
danmicholobot
dannys712-bot
deployment-calendar
dewikinews-rss
dexbot
dow
dykautobot
earwigbot
emijrpbot
erwin85
featured-content-bot
ffbot
fist
fontcdn
forrestbot
galobot
gerakibot
gerrit-reviewer-bot
h78c67c-bot
hay
hewiki-tools
highly-agitated-pages
itwiki
itwiki-scuola-italiana
jackbot
jarry-common
jorobot
kian
lists
logoscope
magnustools
maintgraph
map-of-monuments
mitmachen
mjolnir
most-wanted
nlwiki-herhaalbot
non-robot
openstack-browser
pagepile
pangolinbot1
patrocle
phabbot
phabsearchemail
phansearch
phpcs
pickme
quest
random-featured
rembot
sdbot
search-filters
sergobot-statistics
shex-simple
socksfinder
sourcemd
spur
status
svbot2
svgcheck
sz-iwbot
technischewuensche
tf-image-bot
thanatos
thanks
thesandbot
tnt-dev
toolhub-extension-demo
toolhunt-api
tools-edit-count
top25reportbot
topicmatcher
trainbow
tutor
typo-fixer
update-1lib1ref
vicbot2
video2commons
wd-flaw-finder
wdumps
welcomebot
wgmc
wiki-patrimonio
wiki-stat-portal
wikicup
wikidata-game
wikidata-todo
wikijournalbot
wikilinkbot
wikiloves
wikiprojectlist
wikivoyage
wm-domains
wmch
wmde-access
ws-cat-browser
zhmrtbot
zhwiki-teleirc

Job version upgrade email draft:

Immediate questions based on this:

  • How is the v2 job config format different than the v1 format? (This should be documented at https://wikitech.wikimedia.org/wiki/Help:Toolforge/Running_jobs and summarized here.)
  • I also see no differences with the format of the config file generated by toolforge jobs dump and the file checked in my version control. How do I check what exactly needs changing in my config file?

Job version upgrade email draft:

Immediate questions based on this:

  • How is the v2 job config format different than the v1 format? (This should be documented at https://wikitech.wikimedia.org/wiki/Help:Toolforge/Running_jobs and summarized here.)
  • I also see no differences with the format of the config file generated by toolforge jobs dump and the file checked in my version control. How do I check what exactly needs changing in my config file?

This has to do with the way the job is created in kubernetes, so a difference will not be reflected on the dumps. You know all the fields that are as a result of the legacy k8s specs? we want to get rid of those. If you want to go digging, the easiest way is to get the k8s spec of a job and check the version number in the label.

Maybe we do need to explain what exactly will be changing, but for the average user, they need not care about the change since it's more on the k8s side than in the actual job spec they submit

Upgrade notification to individual maintainer draft

Upgrade Your Old Toolforge Jobs Version to V2 <name>

Hello <name>, you are getting this email because <toolname> - a tool you maintain - came up in the list of affected tools whose jobs need to be upgraded to V2.

To upgrade:
1. Dump your jobs into a yaml with `toolforge jobs dump jobs.yaml`
2. Verify that all your jobs have been dumped into jobs.yaml
3. Run `toolforge jobs flush` to delete the jobs from toolforge
4. Load them back into Toolforge with `toolforge jobs load jobs.yaml`

You have a grace period until 2025-11-20, after which we'll upgrade all the affected jobs ourselves.

Job version upgrade email draft:

Immediate questions based on this:

  • How is the v2 job config format different than the v1 format? (This should be documented at https://wikitech.wikimedia.org/wiki/Help:Toolforge/Running_jobs and summarized here.)
  • I also see no differences with the format of the config file generated by toolforge jobs dump and the file checked in my version control. How do I check what exactly needs changing in my config file?

This has to do with the way the job is created in kubernetes, so a difference will not be reflected on the dumps. You know all the fields that are as a result of the legacy k8s specs? we want to get rid of those. Easiest way is to get the k8s spec of a job and check the version number in the label.

Maybe we do need to explain what exactly will be changing, but for the average user, they need not care about the change since it's more on the k8s side than in the actual job spec they submit

We have scripts to do the upgrade rn if we wanted, but we thought it'd be nice to give users the chance to do that themselves. The exception is for users that created jobs directly using kubectl or k8s api. For such users, all we can do is notify them of the change, and if they fail to upgrade, then toolforge will no longer support their specific jobs (i.e. toolforge jobs list will return nothing for them)

The original message draft talked about the "v2 job spec", which is why I assumed this was about the job configuration and not some internal implementation detail.

But if this is just about internal implementation details, why are we asking tool maintainers to care about it in the first place? IHMO in that case we should just handle it internally like we've handled similar migrations in the past.

The original message draft talked about the "v2 job spec", which is why I assumed this was about the job configuration and not some internal implementation detail.

But if this is just about internal implementation details, why are we asking tool maintainers to care about it in the first place? IHMO in that case we should just handle it internally like we've handled similar migrations in the past.

I personally have no preference either way. Personally, I'd just go ahead and handle it

Moved the draft to etherpad for easy collaboration editing it:
https://etherpad.wikimedia.org/p/2025-10-jobs-recreate-announce

I think we should focus on the fact that we have to restart the jobs, that might be a non-trivial thing for some of the tools (ex. if you use an internal continuous job from a different job, you'll need to recreate them in order currently).

So giving users the opportunity to do the job recreation themselves allows them a 'no-downtime' path.

So I propose:

  • Give some time for users to recreate themselves if needed
  • Do ourselves the leftover tools given the deadline
  • Rephrase the announce to focus less on the details of k8s manifest versions, see my proposal there

We could try to reduce the list of tools there by filtering the ones that have no port specified, as those will have no issues with services when being recreated by us, though there's still a slight chance that there's some special startup needed for them or some other unexpected issue, so I'm ok leaving the full list as is.

fnegri changed the task status from In Progress to Open.Jan 13 2026, 5:20 PM
fnegri subscribed.

Is this something that we still want to do?

fnegri added a subscriber: Raymond_Ndibe.

Is this something that we still want to do?

Yes. It's pretty important. There seems to be more than 1 opinion about it though and that kind of paralyzes me.
It also needs a review of the entire script since we change the how the job load operation works (the previous implementation would have handled the version change automatically, but I'm not sure the current implementation will)

raymond-ndibe opened https://gitlab.wikimedia.org/repos/cloud/toolforge/jobs-api/-/merge_requests/259

runtime::diff_with_running_job: temp conditional to force job version upgrade from v1 -> v2

Discussed in the team meeting today, @taavi suggested we could patch the k8s objects instead of deleting and recreating the jobs. In this way we could avoid having to inform the users, and just bulk-migrate all the jobs.

Discussed with @Raymond_Ndibe, we reviewed the announcement draft and it looks like the only concern with bulk-migrating jobs without informing user was the following one:

If your tool needs some special ordering in the job creation (for example, if you have a continuous job listening on a port, and connect to it from a different job), please recreate the jobs yourself follwing this instructions (link to the instructions).

@Raymond_Ndibe thinks this is no longer an issue, because the update_job method has been changed since and no longer deletes and recreates a job if it's a continuous or scheduled job. Raymond is going to test this scenario in lima-kilo. If everything works as expected, we can proceed with bulk-migrating all jobs.

the update_job method has been changed since and no longer deletes and recreates a job if it's a continuous or scheduled job

@Raymond_Ndibe tested this in lima-kilo and apparently even with the new version, update_job tries to update the job in-place, but fails and falls back to deleting and recreating. However, even when deleting and recreating the job, I could not reproduce a scenario where recreating two jobs in the wrong order causes them to misbehave (full discussion in GitLab).

I think that we can proceed and bulk-migrate all jobs, but to stay on the safe side we can wait until @dcaro is back (in 2 weeks).