Page MenuHomePhabricator

Request increased quota for anchor-corrector Toolforge tool
Closed, ResolvedPublic

Description

Tool Name: anchor-corrector
Quota increase requested: 10 continuous jobs, 3 one time jobs, 8+24Gi
Reason: Due to the need to traverse the entire history of the article and do Longest Common Subsequence comparisons, larger wikis such as en, zh, ja, de require 4GiB of memory. Smaller wikis require 2GiB of memory. As for checking the archive as a one-off task, for large wiki projects 6 GiB of memory is required. These memory requirements work fine with the old Grid Engine.

Event Timeline

JJMC89 renamed this task from Request increased quota for <Replace Me> Toolforge tool to Request increased quota for anchor-corrector Toolforge tool.Nov 3 2023, 1:55 PM
taavi subscribed.

Hi! The jobs anchor-corrector is currently running on the grid engine are cron jobs, not continuous jobs. Are you planning to change from cron jobs to continuous ones at the same time, or what will the continuous jobs quota be used for?

8+24Gi

Just to confirm: you mean up to 32Gi of total memory at the same time, with up to 6Gi for an individual job, right?

This is my current configuration, maybe you can see if it fits?

# toolforge-jobs load toolforge-jobs-cewbot.yml

# 修正失效的章節標題 Fix broken anchor
- name: k8s-tools.anchor-corrector-20201008.fix-anchor.zh
  command: "node ./wikibot/routine/20201008.fix_anchor.js use_language=zh"
  image: node18
  mem: 4Gi
  # a continuous job
  continuous: true
  #no-filelog: true
  schedule: "3 * * * *"
  # wait for this normal job before loading the next
  #wait: true
  #retry: 5
  #filelog-stdout: /data/project/user/custom.out
  #filelog-stderr: /data/project/user/custom.err
  emails: onfailure
  #emails: all
  #emails: onfinish
  #emails: none
- name: k8s-tools.anchor-corrector-20201008.fix-anchor.en
  command: "node ./wikibot/routine/20201008.fix_anchor.js use_language=en"
  image: node18
  mem: 4Gi
  continuous: true
  schedule: "3 * * * *"
  emails: onfailure
- name: k8s-tools.anchor-corrector-20201008.fix-anchor.de
  command: "node ./wikibot/routine/20201008.fix_anchor.js use_language=de"
  image: node18
  mem: 4Gi
  continuous: true
  schedule: "3 * * * *"
  emails: onfailure
- name: k8s-tools.anchor-corrector-20201008.fix-anchor.simple
  command: "node ./wikibot/routine/20201008.fix_anchor.js use_language=simple"
  image: node18
  mem: 4Gi
  continuous: true
  schedule: "3 * * * *"
  emails: onfailure
- name: k8s-tools.anchor-corrector-20201008.fix-anchor.ja
  command: "node ./wikibot/routine/20201008.fix_anchor.js use_language=ja"
  image: node18
  mem: 4Gi
  continuous: true
  schedule: "3 * * * *"
  emails: onfailure
- name: k8s-tools.anchor-corrector-20201008.fix-anchor.zh.wiktionary
  command: "node ./wikibot/routine/20201008.fix_anchor.js use_project=zh.wiktionary"
  image: node18
  mem: 4Gi
  continuous: true
  schedule: "3 * * * *"
  emails: onfailure
- name: k8s-tools.anchor-corrector-20201008.fix-anchor.zh.wikibooks
  command: "node ./wikibot/routine/20201008.fix_anchor.js use_project=zh.wikibooks"
  image: node18
  mem: 4Gi
  continuous: true
  schedule: "3 * * * *"
  emails: onfailure
- name: cron-tools.anchor-corrector-20201008.fix_anchor.archives.en
  command: "node ./wikibot/routine/20201008.fix_anchor.js archives use_language=en"
  image: node18
  mem: 6Gi
  schedule: "29 0 * * 6"
  wait: true
  emails: onfailure
- name: cron-tools.anchor-corrector-20201008.fix_anchor.archives.simple
  command: "node ./wikibot/routine/20201008.fix_anchor.js archives use_language=simple"
  image: node18
  mem: 6Gi
  schedule: "29 0 * * 6"
  wait: true
  emails: onfailure
- name: cron-tools.anchor-corrector-20201008.fix_anchor.archives.zh
  command: "node ./wikibot/routine/20201008.fix_anchor.js archives use_language=zh"
  image: node18
  mem: 6Gi
  schedule: "29 0 * * 6"
  wait: true
  emails: onfailure

8+24Gi

Just to confirm: you mean up to 32Gi of total memory at the same time, with up to 6Gi for an individual job, right?

Yes, please help me to increase the memory capacity limit, thank you.

+1 anything to move jobs off the grid :)

@Kanashimi I've increased the quota for this tool to max 32Gi of memory at once, with up to 6Gi for a single pod.

However, there are two changes you need to make your jobs configuration file:

  • First, continuous: true should not be set when using schedule. Please remove the continuous: true entries from your jobs.
  • Second, instead of running everything as schedule: "3 * * * *", please pick a different minute for each job to run on. This helps us manage the cluster load as things will be more spread out.

Can you please confirm everything works as expected after making those changes? Thanks!

Thank you! I changed the settings so it now looks like I need to increase the number of continuous jobs.

# toolforge-jobs load ~/wikibot/wikitech/toolforge-jobs-${USER/tools./}.yml
# man toolforge-jobs

# 修正失效的章節標題 Fix broken anchor
- name: k8s-tools.anchor-corrector-20201008.fix-anchor.zh
  command: "node ./wikibot/routine/20201008.fix_anchor.js use_language=zh"
  image: node18
  mem: 4Gi
  #cpu: 2
  # a continuous job
  continuous: true
  # https://phabricator.wikimedia.org/T350484
  # continuous: true should not be set when using schedule.
  #schedule: "3 * * * *"
  # wait for this normal job before loading the next
  #wait: true
  #retry: 5
  #no-filelog: true
  #filelog-stdout: /data/project/user/custom.out
  #filelog-stderr: /data/project/user/custom.err
  #emails: all
  #emails: none
  emails: onfailure
  #emails: onfinish
- name: k8s-tools.anchor-corrector-20201008.fix-anchor.en
  command: "node ./wikibot/routine/20201008.fix_anchor.js use_language=en"
  image: node18
  mem: 4Gi
  continuous: true
  emails: onfailure
- name: k8s-tools.anchor-corrector-20201008.fix-anchor.de
  command: "node ./wikibot/routine/20201008.fix_anchor.js use_language=de"
  image: node18
  mem: 4Gi
  continuous: true
  emails: onfailure
- name: k8s-tools.anchor-corrector-20201008.fix-anchor.simple
  command: "node ./wikibot/routine/20201008.fix_anchor.js use_language=simple"
  image: node18
  mem: 2Gi
  continuous: true
  emails: onfailure
- name: k8s-tools.anchor-corrector-20201008.fix-anchor.ja
  command: "node ./wikibot/routine/20201008.fix_anchor.js use_language=ja"
  image: node18
  mem: 4Gi
  continuous: true
  emails: onfailure
- name: k8s-tools.anchor-corrector-20201008.fix-anchor.zh.wiktionary
  command: "node ./wikibot/routine/20201008.fix_anchor.js use_project=zh.wiktionary"
  image: node18
  mem: 2Gi
  continuous: true
  emails: onfailure
- name: k8s-tools.anchor-corrector-20201008.fix-anchor.zh.wikibooks
  command: "node ./wikibot/routine/20201008.fix_anchor.js use_project=zh.wikibooks"
  image: node18
  mem: 2Gi
  continuous: true
  emails: onfailure

- name: cron-tools.anchor-corrector-20201008.fix_anchor.archives.en
  command: "node ./wikibot/routine/20201008.fix_anchor.js archives use_language=en"
  image: node18
  mem: 6Gi
  schedule: "29 0 * * 6"
  #wait: true
  emails: onfailure
- name: cron-tools.anchor-corrector-20201008.fix_anchor.archives.simple
  command: "node ./wikibot/routine/20201008.fix_anchor.js archives use_language=simple"
  image: node18
  mem: 6Gi
  #schedule: "29 0 * * 6"
  wait: true
  emails: onfailure
- name: cron-tools.anchor-corrector-20201008.fix_anchor.archives.zh
  command: "node ./wikibot/routine/20201008.fix_anchor.js archives use_language=zh"
  image: node18
  mem: 6Gi
  #schedule: "29 0 * * 6"
  wait: true
  emails: onfailure

Thank you! I changed the settings so it now looks like I need to increase the number of continuous jobs.

# toolforge-jobs load ~/wikibot/wikitech/toolforge-jobs-${USER/tools./}.yml
# man toolforge-jobs

# 修正失效的章節標題 Fix broken anchor
- name: k8s-tools.anchor-corrector-20201008.fix-anchor.zh
  command: "node ./wikibot/routine/20201008.fix_anchor.js use_language=zh"
  image: node18
  mem: 4Gi
  #cpu: 2
  # a continuous job
  continuous: true

@Kanashimi the continuous: true directive is still present in your yaml config file
Remove or comment them out. Try this out and let us know if it works for you.

Remember to spread out the scheduled time for the jobs to run as suggested above.

@komla I want this task to run continuously, so I removed the schedule setting and left continuous: true. As for the last three tasks, they are set to run on a schedule. For my purposes, is this seems to be an appropriate setting?

@Kanashimi so you want only this job to be a continuous job? then you are good to go.

@taavi So I think I need to increase the quota for running continuous tasks...

Ok, I bumped that quota as well. You should be all set now.

@Kanashimi once you test and this works for you, mark this ticket as resolved.

Thank you for your help! However, I'm getting this error when I run it.
I thought the CPU quota limit was supposed to be a single task limit, maybe I'm wrong?

# toolforge-jobs load ~/wikibot/wikitech/toolforge-jobs-${USER/tools./}.yml
ERROR: An internal error occured while executing this command.
Traceback (most recent call last):
  File "/usr/lib/python3/dist-packages/toolforge_weld/api_client.py", line 117, in _make_request
    response.raise_for_status()
  File "/usr/lib/python3/dist-packages/requests/models.py", line 940, in raise_for_status
    raise HTTPError(http_error_msg, response=self)
requests.exceptions.HTTPError: 500 Server Error: INTERNAL SERVER ERROR for url: https://api.svc.tools.eqiad1.wikimedia.cloud:30003/jobs/api/v1/jobs/

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/usr/lib/python3/dist-packages/tjf_cli/cli.py", line 642, in op_load
    _load_job(api, job, n)
  File "/usr/lib/python3/dist-packages/tjf_cli/cli.py", line 610, in _load_job
    mount=mount,
  File "/usr/lib/python3/dist-packages/tjf_cli/cli.py", line 464, in op_run
    api.post("/jobs/", json=payload)
  File "/usr/lib/python3/dist-packages/toolforge_weld/api_client.py", line 150, in post
    return self._make_request("POST", url, **kwargs).json()
  File "/usr/lib/python3/dist-packages/toolforge_weld/api_client.py", line 130, in _make_request
    raise self.exception_handler(e) from e
tjf_cli.api.TjfCliHttpError: Failed to create a job, likely an internal bug in the jobs framework.

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/usr/lib/python3/dist-packages/tjf_cli/cli.py", line 774, in main
    run_subcommand(args=args, api=api)
  File "/usr/lib/python3/dist-packages/tjf_cli/cli.py", line 721, in run_subcommand
    op_load(api, args.file, args.job)
  File "/usr/lib/python3/dist-packages/tjf_cli/cli.py", line 646, in op_load
    raise TjfCliError(f"Failed to load job {name}") from e
tjf_cli.errors.TjfCliError: Failed to load job k8s-tools.anchor-corrector-20201008.fix-anchor.archives.en
ERROR: Please report this issue to the Toolforge admins: https://w.wiki/6Zuu
# toolforge-jobs list
Job name:                                                     Job type:             Status:
------------------------------------------------------------  --------------------  -------------------------------------
k8s-update.wikibot                                            schedule: 2 17 * * *  Waiting for scheduled time
k8s-tools.anchor-corrector-20201008.fix-anchor.de             continuous            Running
k8s-tools.anchor-corrector-20201008.fix-anchor.en             continuous            Running
k8s-tools.anchor-corrector-20201008.fix-anchor.ja             continuous            Unable to start, out of quota for cpu
k8s-tools.anchor-corrector-20201008.fix-anchor.simple         continuous            Running
k8s-tools.anchor-corrector-20201008.fix-anchor.zh             continuous            Running
k8s-tools.anchor-corrector-20201008.fix-anchor.zh.wikibooks   continuous            Unable to start, out of quota for cpu
k8s-tools.anchor-corrector-20201008.fix-anchor.zh.wiktionary  continuous            Unable to start, out of quota for cpu

This is a bug in the error handling apparently, but the error seems to be:

The CronJob "k8s-tools.anchor-corrector-20201008.fix-anchor.archives.en" is invalid: metadata.name: Invalid value: "k8s-tools.anchor-corrector-20201008.fix-anchor.archives.en": must be no more than 52 characters

Kubernetes does not support names that long.

@taavi Will we accept longer IDs? Or do I have to shorten it?

@taavi Can you help me increase the limit for continuous jobs? The current quota is clearly not enough.

# toolforge-jobs load ~/wikibot/wikitech/toolforge-jobs-${USER/tools./}.yml
ERROR: Error: Invalid job k8s-20201008.fix-anchor.simple: Out of quota for this kind of job. Please see https://w.wiki/6YLP for details.

@taavi Can you help me increase the limit for continuous jobs? The current quota is clearly not enough.

# toolforge-jobs load ~/wikibot/wikitech/toolforge-jobs-${USER/tools./}.yml
ERROR: Error: Invalid job k8s-20201008.fix-anchor.simple: Out of quota for this kind of job. Please see https://w.wiki/6YLP for details.

Yep, bumped that to 10.

@taavi Will we accept longer IDs? Or do I have to shorten it?

Unfortunately you'll have to shorten the names.

@taavi Thank you! I can run these jobs entirely.

Also I would like to ask, when I set up a job to run in a schedule and let other jobs run after it, I find that the other job gets run first. For example, in https://github.com/kanasimi/wikibot/blob/master/wikitech/toolforge-jobs-anchor-corrector.yml, "k8s-20201008.fix-anchor.archives .simple" doesn't wait for "k8s-20201008.fix-anchor.archives.en" to finish running, but starts running directly. And then I get:

ERROR: timed out 300 seconds waiting for job 'k8s-20201008.fix-anchor.archives.simple' to complete:
+-------------+----------------------------------------------------------------------------+
| Job name:   | k8s-20201008.fix-anchor.archives.simple                                    |
+-------------+----------------------------------------------------------------------------+
| Command:    | node ./wikibot/routine/20201008.fix_anchor.js archives use_language=simple |
+-------------+----------------------------------------------------------------------------+
| Job type:   | normal                                                                     |
+-------------+----------------------------------------------------------------------------+
| Image:      | node18                                                                     |
+-------------+----------------------------------------------------------------------------+
| File log:   | yes                                                                        |
+-------------+----------------------------------------------------------------------------+
| Output log: | k8s-20201008.fix-anchor.archives.simple.out                                |
+-------------+----------------------------------------------------------------------------+
| Error log:  | k8s-20201008.fix-anchor.archives.simple.err                                |
+-------------+----------------------------------------------------------------------------+
| Emails:     | onfailure                                                                  |
+-------------+----------------------------------------------------------------------------+
| Resources:  | mem: 6Gi, cpu: default                                                     |
+-------------+----------------------------------------------------------------------------+
| Mounts:     | all                                                                        |
+-------------+----------------------------------------------------------------------------+
| Retry:      | no                                                                         |
+-------------+----------------------------------------------------------------------------+
| Status:     | Running for 5m2s                                                           |
+-------------+----------------------------------------------------------------------------+
| Hints:      | Last run at 2023-11-27T21:54:13Z. Pod in 'Running' phase. State            |
|             | 'running'. Started at '2023-11-27T21:54:15Z'.                              |
+-------------+----------------------------------------------------------------------------+

How should I change the configuration file to get what I want?

@taavi Sorry, could you please help me to see if it is because of my configuration that k8s-20201008.fix-anchor.archives.simple will not wait for k8s-20201008.fix-anchor.archives.en?

@taavi I think the reason is that wait has a timeout of 5 minutes. So it can't be used in combination for long tasks. Please help me to set up 8 Cron jobs at the same time, thank you.

wait does nothing for cron jobs. So I'm not quite sure what you're trying to do here?

@taavi I initially intended to execute the three tasks sequentially. However, I later realized that `wait’ is not suitable for this purpose. Consequently, I had to request an increase in the quota for Cron jobs because I encountered limitations on the allowed number of Cron jobs.