Page MenuHomePhabricator

Request increased quota for cluebotng-review Toolforge tool
Closed, ResolvedPublic

Description

Tool Name: cluebotng-review
Quota increase requested: +8.0 cpu, +8Gi memory, +9 pod
Reason: <Why is this quota increase required?>

We have 5 continuous jobs + 1 webservice job running with the default quota - due to the behaviour of jobs-api, reducing the requested resources consumes up to double the quota.

Standing cpu quota usage: 50m * 6 = 300m
Standing memory quota usage: 0.5 * 6 = 3.0

As these continuous jobs are Deployment objects in kubernetes, in causes such as health check failures, the usage can be doubled as a new Pod is created and then the old Pod torn down.

Additionally to the standing usage, we have 14 scheduled jobs, the schedule is configured to try and minimise conflicts, but depending how long the job takes they can overlap.

3 jobs are likely to overlap (run often), the rest less likely.

This gets us into a position where over half the quota (4.5) is use in normal conditions, up to 7.5 during the runtime (kubernetes) cycling pods.

In the occasions where ad-hoc jobs are required, more replicas are needed to scale up, or scheduled jobs overlap, we quickly run into quota issues.

Event Timeline

With the default memory/cpu:

tools.cluebotng-review@tools-bastion-13:~$ toolforge jobs list
+-------------------------------------------+------------------------+-----------------------------------------------+
|                 Job name:                 |       Job type:        |                    Status:                    |
+-------------------------------------------+------------------------+-----------------------------------------------+
|        add-dangling-edits-to-group        | schedule: 13 21 * * *  |          Waiting for scheduled time           |
|            add-edits-to-queue             |  schedule: 13 6 * * *  |          Waiting for scheduled time           |
|            add-reported-edits             |  schedule: 55 * * * *  |          Waiting for scheduled time           |
|          add-reviews-from-huggle          | schedule: 23 */2 * * * |          Waiting for scheduled time           |
|          add-reviews-from-report          |  schedule: 15 * * * *  |          Waiting for scheduled time           |
|              backup-database              | schedule: 45 */2 * * * |          Waiting for scheduled time           |
|           cleanup-user-records            |  schedule: 13 1 * * *  |          Waiting for scheduled time           |
|             export-statistics             |  schedule: 13 9 * * *  |          Waiting for scheduled time           |
| grant-review-access-from-wikipedia-rights |  schedule: 27 * * * *  |          Waiting for scheduled time           |
|           import-training-data            |  schedule: 15 2 * * *  |          Waiting for scheduled time           |
|           mark-edits-as-deleted           |  schedule: 13 4 * * *  |          Waiting for scheduled time           |
|         mark-edits-as-having-data         |  schedule: 13 3 * * *  |          Waiting for scheduled time           |
|               prune-backups               |  schedule: 30 5 * * *  |          Waiting for scheduled time           |
|        update-edit-classifications        | schedule: 30 */2 * * * |          Waiting for scheduled time           |
|               celery-flower               |       continuous       |                    Running                    |
|               celery-worker               |       continuous       |                    Running                    |
|            cluebotng-reviewer             |       continuous       | Unable to start, out of quota for cpu, memory |
|               grafana-alloy               |       continuous       |                    Running                    |
|                 irc-relay                 |       continuous       |                    Running                    |
|                   redis                   |       continuous       |                    Running                    |
+-------------------------------------------+------------------------+-----------------------------------------------+

According to the tooling everything is dandy:

tools.cluebotng-review@tools-bastion-13:~$ toolforge jobs quota
Running jobs                                  Used    Limit
--------------------------------------------  ------  -------
Total running jobs at once (Kubernetes pods)  8       16
Running one-off and cron jobs                 0       15
CPU                                           4.0     8.0
Memory                                        4.0Gi   8.0Gi

Per-job limits    Used    Limit
----------------  ------  -------
CPU                       3.0
Memory                    6.0Gi

Job definitions                             Used    Limit
----------------------------------------  ------  -------
Cron jobs                                     14       50
Continuous jobs (including web services)       7       16

But really no:

tools.cluebotng-review@tools-bastion-13:~$ kubectl describe quota
Name:                   tool-cluebotng-review
Namespace:              tool-cluebotng-review
Resource                Used    Hard
--------                ----    ----
configmaps              2       10
count/cronjobs.batch    14      50
count/deployments.apps  7       16
count/jobs.batch        0       15
limits.cpu              4       8
limits.memory           4Gi     8Gi
persistentvolumeclaims  0       0
pods                    8       16
requests.cpu            3625m   4
requests.memory         3840Mi  4Gi
secrets                 29      64
services                6       16
services.nodeports      0       0

x-ref T403962

+1 this should help too with the memory/cpu though https://gitlab.wikimedia.org/repos/cloud/toolforge/maintain-kubeusers/-/merge_requests/74

This will help a lot, for now I have increased all the cpu/mem values slightly over the defaults to trigger https://gitlab.wikimedia.org/repos/cloud/toolforge/jobs-api/-/blob/main/tjf/runtimes/k8s/jobs.py?ref_type=heads#L297 which won't be needed with the above MR.

Generally it would be useful to have some more quota for this tool though, we could get away with slightly less than specified above if things are very tight, happy to have that discussion as needed.

Generally it would be useful to have some more quota for this tool though, we could get away with slightly less than specified above if things are very tight, happy to have that discussion as needed.

We are not currently in a tough spot resource-wise, and the request is not too big, so I think we can give the extra resources anyhow.

fnegri changed the task status from Open to In Progress.Sep 9 2025, 4:48 PM
fnegri triaged this task as Medium priority.

Mentioned in SAL (#wikimedia-cloud-feed) [2025-09-10T12:30:16Z] <fnegri@cloudcumin1001> START - Cookbook wmcs.toolforge.component.deploy for component maintain-kubeusers (T403964)

Mentioned in SAL (#wikimedia-cloud-feed) [2025-09-10T12:31:09Z] <fnegri@cloudcumin1001> END (ERROR) - Cookbook wmcs.toolforge.component.deploy (exit_code=97) for component maintain-kubeusers (T403964)

Mentioned in SAL (#wikimedia-cloud-feed) [2025-09-10T12:31:21Z] <fnegri@cloudcumin1001> START - Cookbook wmcs.toolforge.component.deploy for component maintain-kubeusers (T403964)

Mentioned in SAL (#wikimedia-cloud-feed) [2025-09-10T12:45:55Z] <fnegri@cloudcumin1001> END (FAIL) - Cookbook wmcs.toolforge.component.deploy (exit_code=99) for component maintain-kubeusers (T403964)

Quotas increased:

tools.cluebotng-review@tools-bastion-13:~$ kubectl describe quota
Name:                   tool-cluebotng-review
Namespace:              tool-cluebotng-review
Resource                Used    Hard
--------                ----    ----
configmaps              2       10
count/cronjobs.batch    14      50
count/deployments.apps  7       16
count/jobs.batch        0       15
limits.cpu              4500m   16
limits.memory           4608Mi  16Gi
persistentvolumeclaims  0       0
pods                    9       25
requests.cpu            4125m   8
requests.memory         4352Mi  8Gi
secrets                 29      64
services                6       16
services.nodeports      0       0

Mentioned in SAL (#wikimedia-cloud-feed) [2025-09-10T14:08:23Z] <fnegri@cloudcumin1001> START - Cookbook wmcs.toolforge.component.deploy for component maintain-kubeusers (T403964)

Mentioned in SAL (#wikimedia-cloud-feed) [2025-09-10T14:28:40Z] <fnegri@cloudcumin1001> END (PASS) - Cookbook wmcs.toolforge.component.deploy (exit_code=0) for component maintain-kubeusers (T403964)