Page MenuHomePhabricator

[jobs-cli] jobs are being updated (deleted/created) when no changes present
Closed, ResolvedPublicBUG REPORT

Description

Steps to replicate the issue (include links if applicable):

  • Create job definitions in jobs.yaml
  • Run toolforge jobs load jobs.yaml - observe jobs being created/updated
  • Run toolforge jobs load jobs.yaml - observe jobs updated (though no changes are present)

What happens?:

Example job definitions:

tools.cluebotng@tools-bastion-13:~$ sha256sum jobs.yaml
0237c91a7e7b4208307256b9b8af519cb26e66226e27df968481813b2c2ec7b6  jobs.yaml

tools.cluebotng@tools-bastion-13:~$ toolforge jobs load jobs.yaml
INFO: loading job 'core'...
Job core updated
INFO: loading job 'bot'...
Job bot updated
INFO: loading job 'grafana-alloy'...
Job grafana-alloy updated
INFO: loading job 'irc-relay'...
Job irc-relay updated
INFO: loading job 'redis'...
Job redis updated
INFO: loading job 'report-review.import'...
Job report-review.import updated
INFO: loading job 'backup-database'...
Job backup-database updated
INFO: loading job 'prune-backups'...
Job prune-backups updated
INFO: 8 job(s) loaded successfully

Same exact config again:

tools.cluebotng@tools-bastion-13:~$ sha256sum jobs.yaml
0237c91a7e7b4208307256b9b8af519cb26e66226e27df968481813b2c2ec7b6  jobs.yaml

tools.cluebotng@tools-bastion-13:~$ toolforge jobs load jobs.yaml
INFO: loading job 'core'...
Job core updated
INFO: loading job 'bot'...
Job bot updated
INFO: loading job 'grafana-alloy'...
Job grafana-alloy updated
INFO: loading job 'irc-relay'...
Job irc-relay updated
INFO: loading job 'redis'...
Job redis updated
INFO: loading job 'report-review.import'...
Job report-review.import updated
INFO: loading job 'backup-database'...
Job backup-database updated
INFO: loading job 'prune-backups'...
Job prune-backups updated
INFO: 8 job(s) loaded successfully

What should have happened instead?:

Previously if there were no changes present for a job, then the job would not be updated (deleted/re-created), this behaviour should continue to exist today.

I also note that jobs load is very slow, taking > 1.5 min for 9 jobs.

Event Timeline

DamianZaremba renamed this task from jobs are being updated (deleted/created) when no changes present to [jobs-cli] jobs are being updated (deleted/created) when no changes present.Sep 4 2025, 4:56 PM

The slowness might be related to it recreating the jobs, it has to wait for the pods to go away.

This might be a regression, looking

Btw. we have tests for this, so if it's a regression the tests have a bug, but still, worth checking.

What's in your jobs.yaml?

What's in your jobs.yaml?

https://github.com/cluebotng/utilities/blob/main/jobs/cluebotng.yaml

or

https://github.com/cluebotng/reviewer/blob/main/configs/jobs.yaml

We generally are changing the underlying image and then calling restart so load is not done super often these days, but it use to be done quite a load (on every single deploy).

It does not happen for every job, looking:

tools.wm-lol@tools-bastion-13:~$ toolforge jobs load myjobs.yaml 
INFO: loading job 'logrotate-myjob'...
Job logrotate-myjob is already up to date
INFO: 1 job(s) loaded successfully

There is no change in the runtime config:

tools.cluebotng-review@tools-bastion-13:~$ sha256sum jobs.yaml; toolforge jobs load jobs.yaml; kubectl get deployment irc-relay -oyaml > pre.yaml
2ad7dc8ab187af14c6e77d8febc146957e3d34ec6221282357b22e1c8afbdb1e  jobs.yaml
INFO: loading job 'irc-relay'...
Job irc-relay updated
INFO: loading job 'backup-database'...
Job backup-database updated
INFO: loading job 'prune-backups'...
Job prune-backups updated
INFO: 3 job(s) loaded successfully

tools.cluebotng-review@tools-bastion-13:~$ sha256sum jobs.yaml; toolforge jobs load jobs.yaml; kubectl get deployment irc-relay -oyaml > post.yaml
2ad7dc8ab187af14c6e77d8febc146957e3d34ec6221282357b22e1c8afbdb1e  jobs.yaml
INFO: loading job 'irc-relay'...
Job irc-relay updated
INFO: loading job 'backup-database'...
Job backup-database updated
INFO: loading job 'prune-backups'...
Job prune-backups updated
INFO: 3 job(s) loaded successfully

tools.cluebotng-review@tools-bastion-13:~$ diff pre.yaml post.yaml 
6c6
<   creationTimestamp: "2025-09-04T17:11:51Z"
---
>   creationTimestamp: "2025-09-04T17:12:17Z"
19,20c19,20
<   resourceVersion: "3571095480"
<   uid: 69d927af-5672-45ba-a333-6db2e28137c4
---
>   resourceVersion: "3571097449"
>   uid: 9cc7a347-ba67-47b1-968e-a607416df01b
97,98c97,98
<   - lastTransitionTime: "2025-09-04T17:11:51Z"
<     lastUpdateTime: "2025-09-04T17:11:51Z"
---
>   - lastTransitionTime: "2025-09-04T17:12:17Z"
>     lastUpdateTime: "2025-09-04T17:12:17Z"
103,104c103,104
<   - lastTransitionTime: "2025-09-04T17:11:51Z"
<     lastUpdateTime: "2025-09-04T17:11:51Z"
---
>   - lastTransitionTime: "2025-09-04T17:12:17Z"
>     lastUpdateTime: "2025-09-04T17:12:18Z"

@DamianZaremba I think I have the fix, the issue is that the cpu and memory defaults are not being stripped of correctly. Sending a patch.

Entry in jobs.yaml:

- name: irc-relay
  image: tool-cluebotng-review/irc-relay:latest
  command: run-relay
  continuous: true
  cpu: '0.25'
  mem: 0.5Gi
  port: 3334/udp
  replicas: 1
  mount: none

Entry produced from dump:

- command: run-relay
  continuous: true
  cpu: '0.25'
  image: tool-cluebotng-review/irc-relay:latest
  mem: 0.5Gi
  name: irc-relay
  port: 3334/udp
  replicas: 1

The only difference is mount: none (default for build-pack), but removing that from jobs.yaml doesn't change anything

@DamianZaremba I think I have the fix, the issue is that the cpu and memory defaults are not being stripped of correctly. Sending a patch.

Great, I'll leave it with you... let me know if you need more info

group_203_bot_f4d95069bb2675e4ce1fff090c1c1620 opened https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-deploy/-/merge_requests/956

jobs-api: bump to 0.0.412-20250904183922-18515223

@DamianZaremba So this should be fixed now, can you test? (I added also the missing test to the functional tests for next time)

Unfortunately it still appears to be happening;

tools.cluebotng-review@tools-bastion-13:~$ cat jobs.yaml; echo; for x in {1..3}; do toolforge jobs load jobs.yaml; echo; done
---
# Until `port_protocol` is supported in components (T401994)
- name: irc-relay
  image: tool-cluebotng-review/irc-relay:latest
  command: run-relay
  continuous: true
  cpu: '0.25'
  mem: 0.5Gi
  port: 3334/udp
  replicas: 1
  mount: none

# Extra jobs we run, beyond those directly related to the interface
- name: backup-database
  command: bash -c 'mkdir -p /data/project/cluebotng-review/mysql_backups/ && mysqldump --disable-ssl -h $TOOL_DB_HOST -u $TOOL_DB_USER -p$TOOL_DB_PASSWORD $TOOL_DB_SCHEMA | gzip -9 > /data/project/cluebotng-review/mysql_backups/$(date +"%Y%m%dT%H%M%S").sql.gz'
  image: mariadb
  schedule: '45 */2 * * *'
  no-filelog: true
  mount: all

- name: prune-backups
  command: bash -c 'find "/data/project/cluebotng-review/mysql_backups" -mtime +3 -delete'
  image: bookworm
  schedule: '30 5 * * *'
  no-filelog: true
  mount: all

INFO: loading job 'irc-relay'...
Job irc-relay updated
INFO: loading job 'backup-database'...
Job backup-database updated
INFO: loading job 'prune-backups'...
Job prune-backups updated
INFO: 3 job(s) loaded successfully

INFO: loading job 'irc-relay'...
Job irc-relay updated
INFO: loading job 'backup-database'...
Job backup-database updated
INFO: loading job 'prune-backups'...
Job prune-backups updated
INFO: 3 job(s) loaded successfully

INFO: loading job 'irc-relay'...
Job irc-relay updated
INFO: loading job 'backup-database'...
Job backup-database updated
INFO: loading job 'prune-backups'...
Job prune-backups updated
INFO: 3 job(s) loaded successfully

(Confirmed objects in the runtime are being re-created)

hmm, I think the filelog might also not being parsed ok, looking

I'm out of time for now, but feel free to load the jobs on cluebotng-review as above for testing; it will cause the bot to disconnect and re-connect to Libera Chat (if 'broken'), but there's nothing critical being sent there currently.

Once cluebotng-review looks fixed then testing cluebotng (same path) is also fine (preferably we minimise the amount of restarting/downtime there otherwise people notice and alert emails get sent).

fnegri triaged this task as Medium priority.Sep 5 2025, 8:38 AM

group_203_bot_f4d95069bb2675e4ce1fff090c1c1620 opened https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-deploy/-/merge_requests/957

jobs-api: bump to 0.0.413-20250905132653-0a493ae5

Looks like this is working as expected now

tools.cluebotng-review@tools-bastion-13:~$ toolforge jobs load jobs.yaml 
INFO: loading job 'irc-relay'...
Job irc-relay is already up to date
INFO: loading job 'backup-database'...
Job backup-database is already up to date
INFO: loading job 'prune-backups'...
Job prune-backups is already up to date
INFO: 3 job(s) loaded successfully
tools.cluebotng@tools-bastion-13:~$ toolforge jobs load jobs.yaml
INFO: loading job 'core'...
Job core is already up to date
INFO: loading job 'bot'...
Job bot is already up to date
INFO: loading job 'grafana-alloy'...
Job grafana-alloy is already up to date
INFO: loading job 'irc-relay'...
Job irc-relay is already up to date
INFO: loading job 'redis'...
Job redis is already up to date
INFO: loading job 'report-review.import'...
Job report-review.import is already up to date
INFO: loading job 'backup-database'...
Job backup-database is already up to date
INFO: loading job 'prune-backups'...
Job prune-backups is already up to date
INFO: 8 job(s) loaded successfully
dcaro assigned this task to Raymond_Ndibe.
dcaro edited projects, added Toolforge (Toolforge iteration 24); removed Toolforge.
dcaro moved this task from Next Up to Done on the Toolforge (Toolforge iteration 24) board.

Awesome :)