Page MenuHomePhabricator

toolforge-jobs: reject jobs with more resource requests than single pods can use
Closed, ResolvedPublic

Description

We use kubernetes limitranges to limit how much resources single pods can use to prevent overloads of single nodes. By default they look like this:

Type        Resource  Min    Max  Default Request  Default Limit  Max Limit/Request Ratio
----        --------  ---    ---  ---------------  -------------  -----------------------
Container   cpu       50m    1    150m             500m           -
Container   memory    100Mi  4Gi  256Mi            512Mi          -

The jobs framework should reject jobs that try to use more than those amounts (which are configurable per-namespace), given that otherwise those job objects will just fail to create the pod in any case.

Event Timeline

Change 713040 had a related patch set uploaded (by Majavah; author: Majavah):

[cloud/toolforge/jobs-framework-api@main] ops: Validate per-container limits

https://gerrit.wikimedia.org/r/713040

Change 713040 merged by Arturo Borrero Gonzalez:

[cloud/toolforge/jobs-framework-api@main] ops: Validate per-container limits

https://gerrit.wikimedia.org/r/713040

hey @Majavah this is now live in both tools & toolsbeta. If you can give it a try to confirm it works, then we could probably close this task.

thanks! seems to work

[tools.majavah-test@tools-sgebastion-10 ~] $ toolforge-jobs run test --command "./sleep.sh" --image tf-bullseye-std --cpu 700m
[tools.majavah-test@tools-sgebastion-10 ~] $ toolforge-jobs run test --command "./sleep.sh" --image tf-bullseye-std --cpu 1200m
[toolforge-jobs] ERROR: unable to create job: "ERROR: Requested CPU 1200m is over maximum allowed per container (1)"
[tools.majavah-test@tools-sgebastion-10 ~] $ toolforge-jobs run test --command "./sleep.sh" --image tf-bullseye-std  --mem 5Gi
[toolforge-jobs] ERROR: unable to create job: "ERROR: Requested memory 5Gi is over maximumallowed per container (4Gi)"