Page MenuHomePhabricator

Request increased memory quota for wd-shex-infer Toolforge tool
Closed, ResolvedPublic

Description

Tool Name: wd-shex-infer
Quota Increase Requested: limits.memory 10Gi, requests.memory 8Gi
Reason: The Grid Engine version of the tool creates jobs with -mem 8g, and if memory serves, the jobs can actually require that much memory (i.e., I don’t think I just randomly picked that number). For feature parity, I’d like to be able to create Toolforge jobs with the same amount of memory, but the current quota is limits.memory 8Gi (some of which is taken up by the webservice already) and requests.memory 4Gi.

Event Timeline

@dcaro do we have a way to automatically handle requests like this?

@dcaro do we have a way to automatically handle requests like this?

Not yet no, feel free to try to create a cookbook :), though it's managed through commits to the maintain-kubeusers repo:
https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Kubernetes#Quota_management

dcaro claimed this task.

Done:

root@tools-k8s-control-6:~# kubectl -n tool-wd-shex-infer get resourcequotas tool-wd-shex-infer -o json | jq '.spec.hard."limits.memory"'
"10Gi"

requests.memory is now set to 5 Gi, rather than 8 Gi as I requested. Is this intentional?

The limit also isn’t working properly yet; from kubectl get events:

15s         Warning   FailedCreate        job/wd-shex-infer-101               Error creating: pods "wd-shex-infer-101-b25wv" is forbidden: maximum memory usage per Container is 6Gi, but limit is 8G

(I think I’ll just kubectl edit this job to unstuck it and test that the rest of T320140 works, but it would be nice to have this working in general.)

Meh, doesn’t work, Kubernetes complained that the memory limit is immutable (if I understood the error message correctly).

I guess I also need the limitrange increased? At least I can see a 6Gi max there.

tools.wd-shex-infer@tools-sgebastion-10:~/www/python/src$ kubectl describe limitrange
Name:       tool-wd-shex-infer
Namespace:  tool-wd-shex-infer
Type        Resource  Min    Max  Default Request  Default Limit  Max Limit/Request Ratio
----        --------  ---    ---  ---------------  -------------  -----------------------
Container   cpu       50m    3    250m             500m           -
Container   memory    100Mi  6Gi  256Mi            512Mi          -

(I’m leaving the job alive for now, by the way, and hope that it can successfully run once the limitrange has been increased.)

Done:

root@tools-k8s-control-6:~# kubectl -n tool-wd-shex-infer get resourcequotas tool-wd-shex-infer -o json | jq '.spec.hard."limits.memory"'
"10Gi"

https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-deploy/-/merge_requests/197 was never merged. How did this happen?

Just merged it (forgot to), deployed it from the branch as usual,.

I guess I also need the limitrange increased? At least I can see a 6Gi max there.

tools.wd-shex-infer@tools-sgebastion-10:~/www/python/src$ kubectl describe limitrange
Name:       tool-wd-shex-infer
Namespace:  tool-wd-shex-infer
Type        Resource  Min    Max  Default Request  Default Limit  Max Limit/Request Ratio
----        --------  ---    ---  ---------------  -------------  -----------------------
Container   cpu       50m    3    250m             500m           -
Container   memory    100Mi  6Gi  256Mi            512Mi          -

Yep, sorry about that, we don't usually increase the limit range (though maybe we should :/, feels weird limiting).

For the request.memory value, we currently set it to half the memory, I'll have to change our quota managment scripts to allow setting it to something different.

Updated the limitrange:

root@tools-k8s-control-6:~# kubectl -n tool-wd-shex-infer get limitrange tool-wd-shex-infer -o json | jq '.spec.limits[].max.memory'
"10Gi"

For the requests, you can work-around the default by passing the --mem/--cpu when creating the jobs.

project_1317_bot_df3177307bed93c3f34e421e26c86e38 opened https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-deploy/-/merge_requests/199

maintain-kubeusers: bump to 0.0.121-20240219092902-759465a7

Thanks, the updated limitrange seems to be working!

dcaro triaged this task as Medium priority.Feb 21 2024, 10:12 AM

@LucasWerkmeister can this task be resolved, or is there anything missing?

Mentioned in SAL (#wikimedia-cloud) [2024-03-02T12:06:32Z] <wmbot~lucaswerkmeister@tools-sgebastion-10> update config.yaml: increase job limits.memory from 6G to 8G, should be possible now (T357209)

It’s mostly working for me, but I’d still like to be able to set requests.memory higher than at the moment (which is blocked on T357881 if I understand correctly).

starting a run
Update quota for tool wd-shex-infer from version '2-T357209-2' to version '2-T357209-3'
finished run, wrote 0 new accounts, disabled 0 accounts, cleaned up 0 accounts, renewed 0 accounts, updated 1 quotas

This is live now. Sorry for the delay.

Mentioned in SAL (#wikimedia-cloud) [2024-03-08T15:15:54Z] <wmbot~lucaswerkmeister@tools-sgebastion-10> bump requests.memory to 8G (T357209 / T320140)