Tool Name: anchor-corrector
Quota increase requested: 10 continuous jobs, 3 one time jobs, 8+24Gi
Reason: Due to the need to traverse the entire history of the article and do Longest Common Subsequence comparisons, larger wikis such as en, zh, ja, de require 4GiB of memory. Smaller wikis require 2GiB of memory. As for checking the archive as a one-off task, for large wiki projects 6 GiB of memory is required. These memory requirements work fine with the old Grid Engine.
Description
| Status | Subtype | Assigned | Task | ||
|---|---|---|---|---|---|
| Resolved | Kanashimi | T319555 Migrate anchor-corrector from Toolforge GridEngine to Toolforge Kubernetes | |||
| Resolved | komla | T350484 Request increased quota for anchor-corrector Toolforge tool |
Event Timeline
Hi! The jobs anchor-corrector is currently running on the grid engine are cron jobs, not continuous jobs. Are you planning to change from cron jobs to continuous ones at the same time, or what will the continuous jobs quota be used for?
8+24Gi
Just to confirm: you mean up to 32Gi of total memory at the same time, with up to 6Gi for an individual job, right?
This is my current configuration, maybe you can see if it fits?
# toolforge-jobs load toolforge-jobs-cewbot.yml # 修正失效的章節標題 Fix broken anchor - name: k8s-tools.anchor-corrector-20201008.fix-anchor.zh command: "node ./wikibot/routine/20201008.fix_anchor.js use_language=zh" image: node18 mem: 4Gi # a continuous job continuous: true #no-filelog: true schedule: "3 * * * *" # wait for this normal job before loading the next #wait: true #retry: 5 #filelog-stdout: /data/project/user/custom.out #filelog-stderr: /data/project/user/custom.err emails: onfailure #emails: all #emails: onfinish #emails: none - name: k8s-tools.anchor-corrector-20201008.fix-anchor.en command: "node ./wikibot/routine/20201008.fix_anchor.js use_language=en" image: node18 mem: 4Gi continuous: true schedule: "3 * * * *" emails: onfailure - name: k8s-tools.anchor-corrector-20201008.fix-anchor.de command: "node ./wikibot/routine/20201008.fix_anchor.js use_language=de" image: node18 mem: 4Gi continuous: true schedule: "3 * * * *" emails: onfailure - name: k8s-tools.anchor-corrector-20201008.fix-anchor.simple command: "node ./wikibot/routine/20201008.fix_anchor.js use_language=simple" image: node18 mem: 4Gi continuous: true schedule: "3 * * * *" emails: onfailure - name: k8s-tools.anchor-corrector-20201008.fix-anchor.ja command: "node ./wikibot/routine/20201008.fix_anchor.js use_language=ja" image: node18 mem: 4Gi continuous: true schedule: "3 * * * *" emails: onfailure - name: k8s-tools.anchor-corrector-20201008.fix-anchor.zh.wiktionary command: "node ./wikibot/routine/20201008.fix_anchor.js use_project=zh.wiktionary" image: node18 mem: 4Gi continuous: true schedule: "3 * * * *" emails: onfailure - name: k8s-tools.anchor-corrector-20201008.fix-anchor.zh.wikibooks command: "node ./wikibot/routine/20201008.fix_anchor.js use_project=zh.wikibooks" image: node18 mem: 4Gi continuous: true schedule: "3 * * * *" emails: onfailure - name: cron-tools.anchor-corrector-20201008.fix_anchor.archives.en command: "node ./wikibot/routine/20201008.fix_anchor.js archives use_language=en" image: node18 mem: 6Gi schedule: "29 0 * * 6" wait: true emails: onfailure - name: cron-tools.anchor-corrector-20201008.fix_anchor.archives.simple command: "node ./wikibot/routine/20201008.fix_anchor.js archives use_language=simple" image: node18 mem: 6Gi schedule: "29 0 * * 6" wait: true emails: onfailure - name: cron-tools.anchor-corrector-20201008.fix_anchor.archives.zh command: "node ./wikibot/routine/20201008.fix_anchor.js archives use_language=zh" image: node18 mem: 6Gi schedule: "29 0 * * 6" wait: true emails: onfailure
@Kanashimi I've increased the quota for this tool to max 32Gi of memory at once, with up to 6Gi for a single pod.
However, there are two changes you need to make your jobs configuration file:
- First, continuous: true should not be set when using schedule. Please remove the continuous: true entries from your jobs.
- Second, instead of running everything as schedule: "3 * * * *", please pick a different minute for each job to run on. This helps us manage the cluster load as things will be more spread out.
Can you please confirm everything works as expected after making those changes? Thanks!
Mentioned in SAL (#wikimedia-cloud) [2023-11-08T16:18:23Z] <taavi> bump quotas per T350484#9316930
Thank you! I changed the settings so it now looks like I need to increase the number of continuous jobs.
# toolforge-jobs load ~/wikibot/wikitech/toolforge-jobs-${USER/tools./}.yml
# man toolforge-jobs
# 修正失效的章節標題 Fix broken anchor
- name: k8s-tools.anchor-corrector-20201008.fix-anchor.zh
command: "node ./wikibot/routine/20201008.fix_anchor.js use_language=zh"
image: node18
mem: 4Gi
#cpu: 2
# a continuous job
continuous: true
# https://phabricator.wikimedia.org/T350484
# continuous: true should not be set when using schedule.
#schedule: "3 * * * *"
# wait for this normal job before loading the next
#wait: true
#retry: 5
#no-filelog: true
#filelog-stdout: /data/project/user/custom.out
#filelog-stderr: /data/project/user/custom.err
#emails: all
#emails: none
emails: onfailure
#emails: onfinish
- name: k8s-tools.anchor-corrector-20201008.fix-anchor.en
command: "node ./wikibot/routine/20201008.fix_anchor.js use_language=en"
image: node18
mem: 4Gi
continuous: true
emails: onfailure
- name: k8s-tools.anchor-corrector-20201008.fix-anchor.de
command: "node ./wikibot/routine/20201008.fix_anchor.js use_language=de"
image: node18
mem: 4Gi
continuous: true
emails: onfailure
- name: k8s-tools.anchor-corrector-20201008.fix-anchor.simple
command: "node ./wikibot/routine/20201008.fix_anchor.js use_language=simple"
image: node18
mem: 2Gi
continuous: true
emails: onfailure
- name: k8s-tools.anchor-corrector-20201008.fix-anchor.ja
command: "node ./wikibot/routine/20201008.fix_anchor.js use_language=ja"
image: node18
mem: 4Gi
continuous: true
emails: onfailure
- name: k8s-tools.anchor-corrector-20201008.fix-anchor.zh.wiktionary
command: "node ./wikibot/routine/20201008.fix_anchor.js use_project=zh.wiktionary"
image: node18
mem: 2Gi
continuous: true
emails: onfailure
- name: k8s-tools.anchor-corrector-20201008.fix-anchor.zh.wikibooks
command: "node ./wikibot/routine/20201008.fix_anchor.js use_project=zh.wikibooks"
image: node18
mem: 2Gi
continuous: true
emails: onfailure
- name: cron-tools.anchor-corrector-20201008.fix_anchor.archives.en
command: "node ./wikibot/routine/20201008.fix_anchor.js archives use_language=en"
image: node18
mem: 6Gi
schedule: "29 0 * * 6"
#wait: true
emails: onfailure
- name: cron-tools.anchor-corrector-20201008.fix_anchor.archives.simple
command: "node ./wikibot/routine/20201008.fix_anchor.js archives use_language=simple"
image: node18
mem: 6Gi
#schedule: "29 0 * * 6"
wait: true
emails: onfailure
- name: cron-tools.anchor-corrector-20201008.fix_anchor.archives.zh
command: "node ./wikibot/routine/20201008.fix_anchor.js archives use_language=zh"
image: node18
mem: 6Gi
#schedule: "29 0 * * 6"
wait: true
emails: onfailure@Kanashimi the continuous: true directive is still present in your yaml config file
Remove or comment them out. Try this out and let us know if it works for you.
Remember to spread out the scheduled time for the jobs to run as suggested above.
@komla I want this task to run continuously, so I removed the schedule setting and left continuous: true. As for the last three tasks, they are set to run on a schedule. For my purposes, is this seems to be an appropriate setting?
@Kanashimi so you want only this job to be a continuous job? then you are good to go.
Thank you for your help! However, I'm getting this error when I run it.
I thought the CPU quota limit was supposed to be a single task limit, maybe I'm wrong?
# toolforge-jobs load ~/wikibot/wikitech/toolforge-jobs-${USER/tools./}.yml
ERROR: An internal error occured while executing this command.
Traceback (most recent call last):
File "/usr/lib/python3/dist-packages/toolforge_weld/api_client.py", line 117, in _make_request
response.raise_for_status()
File "/usr/lib/python3/dist-packages/requests/models.py", line 940, in raise_for_status
raise HTTPError(http_error_msg, response=self)
requests.exceptions.HTTPError: 500 Server Error: INTERNAL SERVER ERROR for url: https://api.svc.tools.eqiad1.wikimedia.cloud:30003/jobs/api/v1/jobs/
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/usr/lib/python3/dist-packages/tjf_cli/cli.py", line 642, in op_load
_load_job(api, job, n)
File "/usr/lib/python3/dist-packages/tjf_cli/cli.py", line 610, in _load_job
mount=mount,
File "/usr/lib/python3/dist-packages/tjf_cli/cli.py", line 464, in op_run
api.post("/jobs/", json=payload)
File "/usr/lib/python3/dist-packages/toolforge_weld/api_client.py", line 150, in post
return self._make_request("POST", url, **kwargs).json()
File "/usr/lib/python3/dist-packages/toolforge_weld/api_client.py", line 130, in _make_request
raise self.exception_handler(e) from e
tjf_cli.api.TjfCliHttpError: Failed to create a job, likely an internal bug in the jobs framework.
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/usr/lib/python3/dist-packages/tjf_cli/cli.py", line 774, in main
run_subcommand(args=args, api=api)
File "/usr/lib/python3/dist-packages/tjf_cli/cli.py", line 721, in run_subcommand
op_load(api, args.file, args.job)
File "/usr/lib/python3/dist-packages/tjf_cli/cli.py", line 646, in op_load
raise TjfCliError(f"Failed to load job {name}") from e
tjf_cli.errors.TjfCliError: Failed to load job k8s-tools.anchor-corrector-20201008.fix-anchor.archives.en
ERROR: Please report this issue to the Toolforge admins: https://w.wiki/6Zuu# toolforge-jobs list Job name: Job type: Status: ------------------------------------------------------------ -------------------- ------------------------------------- k8s-update.wikibot schedule: 2 17 * * * Waiting for scheduled time k8s-tools.anchor-corrector-20201008.fix-anchor.de continuous Running k8s-tools.anchor-corrector-20201008.fix-anchor.en continuous Running k8s-tools.anchor-corrector-20201008.fix-anchor.ja continuous Unable to start, out of quota for cpu k8s-tools.anchor-corrector-20201008.fix-anchor.simple continuous Running k8s-tools.anchor-corrector-20201008.fix-anchor.zh continuous Running k8s-tools.anchor-corrector-20201008.fix-anchor.zh.wikibooks continuous Unable to start, out of quota for cpu k8s-tools.anchor-corrector-20201008.fix-anchor.zh.wiktionary continuous Unable to start, out of quota for cpu
This is a bug in the error handling apparently, but the error seems to be:
The CronJob "k8s-tools.anchor-corrector-20201008.fix-anchor.archives.en" is invalid: metadata.name: Invalid value: "k8s-tools.anchor-corrector-20201008.fix-anchor.archives.en": must be no more than 52 characters
Kubernetes does not support names that long.
@taavi Can you help me increase the limit for continuous jobs? The current quota is clearly not enough.
# toolforge-jobs load ~/wikibot/wikitech/toolforge-jobs-${USER/tools./}.yml
ERROR: Error: Invalid job k8s-20201008.fix-anchor.simple: Out of quota for this kind of job. Please see https://w.wiki/6YLP for details.taavi opened https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-deploy/-/merge_requests/140
maintain-kubeusers: bump deployment quota for anchor-corrector
taavi merged https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-deploy/-/merge_requests/140
maintain-kubeusers: bump deployment quota for anchor-corrector
@taavi Thank you! I can run these jobs entirely.
Also I would like to ask, when I set up a job to run in a schedule and let other jobs run after it, I find that the other job gets run first. For example, in https://github.com/kanasimi/wikibot/blob/master/wikitech/toolforge-jobs-anchor-corrector.yml, "k8s-20201008.fix-anchor.archives .simple" doesn't wait for "k8s-20201008.fix-anchor.archives.en" to finish running, but starts running directly. And then I get:
ERROR: timed out 300 seconds waiting for job 'k8s-20201008.fix-anchor.archives.simple' to complete: +-------------+----------------------------------------------------------------------------+ | Job name: | k8s-20201008.fix-anchor.archives.simple | +-------------+----------------------------------------------------------------------------+ | Command: | node ./wikibot/routine/20201008.fix_anchor.js archives use_language=simple | +-------------+----------------------------------------------------------------------------+ | Job type: | normal | +-------------+----------------------------------------------------------------------------+ | Image: | node18 | +-------------+----------------------------------------------------------------------------+ | File log: | yes | +-------------+----------------------------------------------------------------------------+ | Output log: | k8s-20201008.fix-anchor.archives.simple.out | +-------------+----------------------------------------------------------------------------+ | Error log: | k8s-20201008.fix-anchor.archives.simple.err | +-------------+----------------------------------------------------------------------------+ | Emails: | onfailure | +-------------+----------------------------------------------------------------------------+ | Resources: | mem: 6Gi, cpu: default | +-------------+----------------------------------------------------------------------------+ | Mounts: | all | +-------------+----------------------------------------------------------------------------+ | Retry: | no | +-------------+----------------------------------------------------------------------------+ | Status: | Running for 5m2s | +-------------+----------------------------------------------------------------------------+ | Hints: | Last run at 2023-11-27T21:54:13Z. Pod in 'Running' phase. State | | | 'running'. Started at '2023-11-27T21:54:15Z'. | +-------------+----------------------------------------------------------------------------+
How should I change the configuration file to get what I want?
@taavi Sorry, could you please help me to see if it is because of my configuration that k8s-20201008.fix-anchor.archives.simple will not wait for k8s-20201008.fix-anchor.archives.en?
@taavi I think the reason is that wait has a timeout of 5 minutes. So it can't be used in combination for long tasks. Please help me to set up 8 Cron jobs at the same time, thank you.
wait does nothing for cron jobs. So I'm not quite sure what you're trying to do here?
@taavi I initially intended to execute the three tasks sequentially. However, I later realized that `wait’ is not suitable for this purpose. Consequently, I had to request an increase in the quota for Cron jobs because I encountered limitations on the allowed number of Cron jobs.