Page MenuHomePhabricator

Audit tools memory requests vs actual usage
Open, In Progress, MediumPublic

Description

In parent task we were discussing high load alerts for Toolforge k8s cluster, one of these is % memory requests vs total memory allocatable: going over 100% means new pods can't be scheduled.

I did an audit of tools based on k8s namespace memory here: https://w.wiki/Jzk9 with this query:

(
  label_replace(
    sum by(container_label_io_kubernetes_pod_namespace) (
      container_memory_working_set_bytes{container_label_io_cri_containerd_kind!="sandbox"}
    ),
    "namespace", "$1", "container_label_io_kubernetes_pod_namespace", "(.*)"
  )
)
/ on(namespace)
sum by(namespace) (
  kube_pod_container_resource_requests{resource="memory"}
)
* 100

Today's results are at P89886. I then did a cumulative frequency distribution graph for the data:

2026-03-19-112544_1603x790_scrot.png (790×1 px, 73 KB)

Of immediate note the fact that 50% of namespaces/tools use less than 20%, i.e. we could be reducing their requests by 4-5x

Overview

From the audit about what tools are running with default values (P89954) it seems most webservice tools are not overriding default values. I have decided to focus on those first as the lowest hanging fruit. The plan is to go from 256MB requests to 128MB first, then assess the situation both in overall cluster behaviour and invidual tool. Assuming all goes well, we'll move to requesting 64MB by default, while leaving limits untouched.

For reference: over the last 7d (apr 17-24) about 2200 tool containers never went above 64MB working set size (container_memory_working_set_bytes) while about 1300 went over 64MB, about 700 over 128MB and about 340 over 256MB

Deployment

I was thinking of the following deployment plan:

  1. New package is available
  2. Install in toolsbeta, launch a webservice and verify memory requests
  3. Deploy to tools
  4. Restart a sample/chosen webservice with defaults, verify it comes back with adjusted memory request
  5. Roll-restart webservices using default requests listed in P89954 (i.e. at the bottom), using the script in P91435 with --cutoff-timestamp set to the start of this work
  6. Keep jobs monitored for restarts/oomkills (kube_pod_container_status_last_terminated_reason{reason="OOMKilled"}) and kubelet evicting pods (increase(kubelet_evictions[5m])) via this dashboard https://grafana.wmcloud.org/d/fir9gwd/filippo-global-tools-stats

Details

Related Changes in Gerrit:

Event Timeline

fnegri triaged this task as Medium priority.Mar 19 2026, 11:25 AM

Of immediate note the fact that 50% of namespaces/tools use less than 20%, i.e. we could be reducing their requests by 4-5x

Can we check how many of the ones that use more memory are actually setting the limits themselves vs using the default values?
It would be great if we can reduce the defaults without needing any user to add extra config to their tools.

Of immediate note the fact that 50% of namespaces/tools use less than 20%, i.e. we could be reducing their requests by 4-5x

Can we check how many of the ones that use more memory are actually setting the limits themselves vs using the default values?
It would be great if we can reduce the defaults without needing any user to add extra config to their tools.

Great question, I spent some time thinking how we would go about answering that. I'm getting familiar with Toolforge and its interactions with k8s so I may be off here!

I ran an audit on all Deployment objects (i.e. tjf + webservice) and compared limits (either of the pod template or the first container in the pod) with tjf or webservice defaults, a summary looks like this:

Total: 2306 | customized: 957 | default: 1265 | no-manager: 42 | not-set: 24 | unknown-manager: 18

The interesting numbers are default and customized for deployments that use the tjf or webservice defaults (cpu/mem/replicas) or not.

The code (implementation by Claude, prompts by me) is here P89953 while the full report is here P89954.

Please note that I may have gotten the logic wrong on how limits/requests are set by tjf/webservice! Let me know what you think

Thinking about this problem a little more: we would be lowering the default memory request, while leaving limit untouched, therefore I think it should be safe to do: many tools are already exceeding their requests though not hitting the limit today. I'm for testing a 128mb memory default request and take it from there, what do you think ?

I'm ok with testing a 128mb default request, instead of the current 256mb. Worst case we can revert it if we see this causes any issues, the only one I can think of is it could lead to many more pods being scheduled in a single worker, and they would no longer be able to grow to their "limit" value.

Changing the default is easy, it's in kubernetes.py#L243 for webservice and in models.py#L52 for jobs.

Updating running tools to use the new defaults is more tricky, but we can think about it later. Even ignoring running tools and only applying the new values to new jobs and webservices we should see a decrease in the total memory requests.

Note that for jobs, there is only one value JOB_DEFAULT_MEMORY = "512Mi" that is used as a memory limit, the memory request is set to limit/2. If we want to keep the current limit at 512 we also need to change the logic in jobs.py#L324.

I started from webservice-cli limits, and was thinking of the following deployment plan:

  1. New package is available
  2. Install in toolsbeta, launch a webservice and verify memory requests
  3. Deploy to tools
  4. Restart a sample/chosen webservice with defaults, verify it comes back with adjusted memory request
  5. Roll-restart webservices using default requests listed in P89954 (i.e. at the bottom), using the script in P91435 with --cutoff-timestamp set to the start of this work
  6. Keep jobs monitored for restarts/oomkills (kube_pod_container_status_last_terminated_reason{reason="OOMKilled"}) and kubelet evicting pods (increase(kubelet_evictions[5m]))

Change #1277065 had a related patch set uploaded (by Filippo Giunchedi; author: Filippo Giunchedi):

[operations/puppet@production] kubeadm: quote kubectl arguments

https://gerrit.wikimedia.org/r/1277065

Change #1277065 merged by Filippo Giunchedi:

[operations/puppet@production] kubeadm: quote kubectl arguments

https://gerrit.wikimedia.org/r/1277065

fnegri changed the task status from Open to In Progress.Tue, Apr 28, 4:26 PM
fnegri assigned this task to fgiunchedi.
fnegri moved this task from In review to In progress on the tools-platform-team board.

Assigning to @fgiunchedi as he's actively working on this. This announcement was sent today:

The first phase will involve changing requests from 256MB to 128MB on Wed May 6th starting at 8 UTC, and from 128MB to 64MB on Tue May 12th starting at 8 UTC. We will be restarting webservice tools as part of the deployment and no action is required on tool maintainers' part.

Mentioned in SAL (#wikimedia-cloud) [2026-05-13T08:46:37Z] <godog> restart sample webservices with new memory requests https://phabricator.wikimedia.org/P92497 - T420565

Mentioned in SAL (#wikimedia-cloud) [2026-05-13T12:07:02Z] <godog> resume restarting webservices using default memory requests - T420565

The first reduction is default memory requests has been deployed, as expected we're now under the alerting threshold for memory requests (from ~88% to ~76%)

2026-05-14-084827_3780x1616_scrot.png (1×3 px, 242 KB)

I have not observed any ill effects so far, and the distribution of % memory used by pods has moved right (https://grafana.wmcloud.org/d/fir9gwd/filippo-tools-memory-requests-overview)

2026-05-14-084515_2710x854_scrot.png (854×2 px, 42 KB)

2026-05-14-084531_2704x852_scrot.png (852×2 px, 41 KB)