Page MenuHomePhabricator

Re-visit Toolforge Kubernetes default quotas (April 2023)
Closed, ResolvedPublic

Description

The Toolforge Kubernetes cluster is seeing a large number of quota errors when trying to create pods or jobs. As those happen when launching a cronjob for example, it's hard for tool maintainers to see them.
For example during an arbitrary 10-minute period today there were 124 errors:

taavi@tools-k8s-control-1:~$ sudo kubectl logs -n kube-system kube-controller-manager-tools-k8s-control-1 | grep quota | grep " 15:4" | wc -l
124

Most of those are tools hitting the default CPU, RAM or pod limits. T333976 exists to expose them more clearly to tool maintainers, but in the meantime we should consider raising the default quotas given how many tools are hitting them.

Details

ReferenceSource BranchDest BranchAuthorTitle
repos/cloud/toolforge/toolforge-deploy!126bump_maintain-kubeusersmaintaavimaintain-kubeusers: bump to 0.0.106-20231109100226-523f62c0
repos/cloud/toolforge/maintain-kubeusers!5taavi/quotasmaintaaviMake default quota configurable and increase it
Customize query in GitLab

Event Timeline

+1 to expand, as well as communicate this to users.

Proposal: Let's pick a reasonable default pod quota (current is 10, maybe that or 16?), and then update the default CPU and RAM quotas to match the pod quota multiplied by the jobs-api default settings for CPU and RAM.

Proposal: Let's pick a reasonable default pod quota (current is 10, maybe that or 16?), and then update the default CPU and RAM quotas to match the pod quota multiplied by the jobs-api default settings for CPU and RAM.

Looks good to me.

Also, what about other resources? LIke Depoloyment, Service, Ingress, etc. Do we need to refresh the default quotas for them as well?

Deployments limits the webservice + continuous jobs, so it may need a bump.

taavi changed the task status from Open to In Progress.Nov 7 2023, 2:12 PM
taavi moved this task from Next Up to In Review on the Toolforge (Toolforge iteration 02) board.