Page MenuHomePhabricator

[tools,infra,k8s] scale up the cluster, specifically CPU
Open, In Progress, HighPublic

Description

We have been hitting the limit a couple times in the last few days, we should expand the cluster a bit.

We might consider using a bigger VM for the new workers too to give better chance to run bigger jobs too.

Things to clarify:

  • What flavor of nodes are hitting the limit
  • How many workers to add
  • What flavor/vm size to use
  • It's only cpu or also mem? (should we change the ratios for the worker VMs?)

Limits proposal

For defaults:

  • cpu/request -> 100m (applied already)
  • cpu/limit -> 1cpu (applied already)
  • memory/request -> 512Mi (current value)
  • memory/limit -> 512Mi (current value)

For user set values (they can only specify --cpu or --mem):

  • cpu/request + cpu/limit = user set value
  • mem/request + mem/limit = user set value

Alerts porposal

  • page: If user's can't schedule workload
    • measured by something like:
sum (kube_pod_status_phase{job="k8s-pods", prometheus="k8s", phase="Pending"}) / sum (kube_pod_status_phase{job="k8s-pods", prometheus="k8s"}) > 0.1
  • page: If user's workloads are being widely killed
    • measured by the kube_pod_container_status_terminated_reason increase over time (ex. if there's a sustained peak, values to tweak with experience)
  • warning: If the overall cluster load (cpu/mem used) is very high for long
    • measured in the span of a day, if it gets over 80% or any of those, with the recommendation double check and scale it up

Details

Related Changes in GitLab:
TitleReferenceAuthorSource BranchDest Branch
maintain-kubeusers: bump to 0.0.183-20250924080007-93ad9a3frepos/cloud/toolforge/toolforge-deploy!978group_203_bot_f4d95069bb2675e4ce1fff090c1c1620bump_maintain-kubeusersmain
quota: adapt the quota to the new default cpurepos/cloud/toolforge/maintain-kubeusers!75dcaroincrease_namespace_cpu_quotamain
d/changelog: bump to 16.1.21repos/cloud/toolforge/jobs-cli!131dcarobump_jobs-climain
job_prepare_for_output: strip default cpu/memrepos/cloud/toolforge/jobs-cli!130dcarohide_default_mem_cpumain
runtimes.k8s.jobs.get_job_from_k8s: report the default cpu as defaultrepos/cloud/toolforge/jobs-api!216dcarofix_cpu_resourcemain
jobs-api: bump to 0.0.416-20250923131926-45b7b4f9repos/cloud/toolforge/toolforge-deploy!977group_203_bot_f4d95069bb2675e4ce1fff090c1c1620bump_jobs-apimain
kubernetes.capacity: don't page yetrepos/cloud/toolforge/alerts!38dcarodont_page_on_capacitymain
resources: reduce the default cpu k8s requestrepos/cloud/toolforge/jobs-api!215dcaroreduce_default_cpu_requestmain
Customize query in GitLab

Event Timeline

I added a couple new graphs to the toolforge global overview dashboard:
https://grafana-rw.wmcloud.org/d/8GiwHDL4k/infra-kubernetes-cluster-overview

image.png (353×2 px, 200 KB)

And it seems we are over-requesting cpu by >6x, so I'm thinking on lowering the default cpu request for jobs instead of scaling up the cluster.

dcaro triaged this task as High priority.Sep 18 2025, 1:34 PM
dcaro changed the task status from Open to In Progress.Sep 18 2025, 2:02 PM
dcaro moved this task from Next Up to In Progress on the Toolforge (Toolforge iteration 24) board.

In the team meeting from today we decided that we should first reduce the default cpu request according to the mean cpu usage per pod in the cluster (patch here https://gitlab.wikimedia.org/repos/cloud/toolforge/jobs-api/-/merge_requests/215).

I have two questions that I want to investigate:

Should we just not set cpu limits?

From what I understand (https://kubernetes.io/docs/concepts/configuration/manage-resources-containers/ and https://kubernetes.io/docs/concepts/workloads/pods/pod-qos/), the cpu requests are used for two things, from the docs there:

  • The CPU limit defines a hard ceiling on how much CPU time the container can use. During each scheduling interval (time slice), the Linux kernel checks to see if this limit is exceeded; if so, the kernel waits before allowing that cgroup to resume execution.
  • The CPU request typically defines a weighting. If several different containers (cgroups) want to run on a contended system, workloads with larger CPU requests are allocated more CPU time than workloads with small requests.

that "typically" is a bit annoying, but if that is true, we can just not set limits and allow users to use as much cpu as they have in the worker node, and get throttled only when the node is loaded according to their request. That sound to me like a good option though I want to test if that's the case.

What about memory limits?

Similarly, when a pod is scheduled, it uses it's request value to allocate a node that has at least that amount memory. On the other side, when it hits it's limit value, the pod gets killed by the kernel OOM.

Open questions I need to investigate:

  • What happens if a pod has no memory limit?
  • What happens if a pod has a high limit, but it's higher than the node free memory? Does it get killed when the node is out of memory? Does the node die?

When to alert and on what

Currently we are alerting based on the total requests, and the available allocatable space in the workers (see https://gitlab.wikimedia.org/repos/cloud/toolforge/alerts/-/blob/main/kubernetes/capacity.yaml?ref_type=heads), but that might not be the right metric (or not the only one).

The goal here is to get alerted when users can't use the service, but only get a warning when the cluster is getting overloaded (imo, up for discussion).

For that:

  • If the cpu requests is bigger than the allocatable space, then users will not be able to get their pods scheduled anywhere, so they will stop running -> page?
  • If the memory request is bigger than the allocatable space, same, no scheduling happens -> page?
  • If the cpu limits are bigger than the allocatable space, nothing, pods still get scheduled, pods run, get throttled.
  • If the memory limits are bigger than the allocatable space, then nothing either? will the node throttle the pods? (pending open questions above)
  • If the worker nodes are getting CPU/load spikes, pods will still run and get scheduled around, though we should consider scaling up the cluster -> warning?
  • If the worker nodes are getting memory usage spikes, pods will start dying and user workloads stopping -> page?

@akosiaris @taavi @Andrew @fgiunchedi pinging you here as we discussed in the team meet, please share ideas, opinions, corrections, etc. (thanks! :) )

Thanks for this writeup! Couple of inline replies

In the team meeting from today we decided that we should first reduce the default cpu request according to the mean cpu usage per pod in the cluster (patch here https://gitlab.wikimedia.org/repos/cloud/toolforge/jobs-api/-/merge_requests/215).

I have two questions that I want to investigate:

Should we just not set cpu limits?

Probably not. At least not for the type of workloads Toolforge hosts. More on that below.

From what I understand (https://kubernetes.io/docs/concepts/configuration/manage-resources-containers/ and https://kubernetes.io/docs/concepts/workloads/pods/pod-qos/), the cpu requests are used for two things, from the docs there:

I think you want a want a s/cpu requests/cpu resources above. Otherwise it's confusing (to me at least)

  • The CPU limit defines a hard ceiling on how much CPU time the container can use. During each scheduling interval (time slice), the Linux kernel checks to see if this limit is exceeded; if so, the kernel waits before allowing that cgroup to resume execution.

Yes. There are some details and the infamous 512ac999 kernel patch, but otherwise this is correct. The end result is that the application gets throttled. This is evident in both the repercussions (the app is able to do substantially less work. Latency and throughput both suffer significantly) as well as metrics that are exported by the cadvisor part of the kubelet (or cadvisor independently if also installed). The metrics in prometheus are named container_cpu_cfs_throttled_periods_total and container_cpu_cfs_periods_total

  • The CPU request typically defines a weighting. If several different containers (cgroups) want to run on a contended system, workloads with larger CPU requests are allocated more CPU time than workloads with small requests.

The key word here being contended, which is not the typical case. In any system that is not struggling, requests are not taken into account by the kernel. It's when the system is under stress that this setting, kernel wise, starts having an impact and it works the way you describe it. Once there isn't any free CPU left to assign, THEN the cgroups are weighted and given their quota. I would say that this is a situation that you almost never want to be in, at least not for a prolonged amount of time. For one thing, it shortens the lifetime of hardware. And makes a system less responsive overall.

that "typically" is a bit annoying,

It's not just annoying, it's misleading IMHO, as a contended system is NOT what a lot of people would call typical. It's most certainly not something they have to deal with day-in-day-out.

but if that is true, we can just not set limits and allow users to use as much cpu as they have in the worker node, and get throttled only when the node is loaded according to their request. That sound to me like a good option though I want to test if that's the case.

The problem with this approach is that when 1 workload misbehaves, suddenly ALL workloads on the node suffer. In some cases, e.g. MediaWiki in the WikiKube cluster, it makes total sense. When 1 of the workloads is absolutely critical and every other workload is assisting it in achieving it's role, it's ok if everything suffer when the critical one suffers. It's NOT however OK for an assisting workload to make the main one suffer.

However, Toolforge doesn't have a user workload that is absolutely critical and everything else being secondary. So, it's not expected for a random workload to make everything else suffer. We could argue about ranking tools on their importance, but that's probably going to be an exercise in frustration for everyone either involved or impacted by this.

Now, in the best case scenario, one is able to observe the workload for an amount of time under some synthetic (or real) scenarios and come up with the best numbers for this, tailored to the workload. I don't think this is even remotely feasible for Toolforge.

Which effectively means, that in order to protect the other workloads from the misbehaving ones, you need some form of a default CPU limit for everything (except the cluster components, e.g. calico, those are so critical that if they suffer, it's an incident. Ask me how I know...). It can be high enough that it triggers rarely, say when a workload consumes 50-60% of the CPU of a node.

All of this is for the kernel side. For the Kubernetes side, things are a the other way around. The scheduler ignores limits and only cares about requests. The scheduler tries to solve the problem of which node is a good enough node to place this pod in? It's a form of a Discrete Knapsack problem, based on this information as well as some other topological stuff like affinity/anti-affinity/tolerations/taints and device availability in the general case.

For Toolforge, what I think we want is to find a default value for CPU requests that matches the average of CPU usage across all workloads and nodes. Specific workloads that are known to deviate a lot from this average (the ones above 1 sigma - 1 stddev) can be manually set to a different value, although that's probably not required.

What about memory limits?

Similarly, when a pod is scheduled, it uses it's request value to allocate a node that has at least that amount memory. On the other side, when it hits it's limit value, the pod gets killed by the kernel OOM.

Open questions I need to investigate:

  • What happens if a pod has no memory limit?

Depends on whether it has a requests stanza or not specified. If it doesn't, it can consume all the memory of the node. Depending on how fast that happens, it might trigger kubernetes eviction or it might trigger OOMKiller. What OOMKiller will decide to kill is up to the OOMKiller. In many cases it will be the app in the pod. In other cases, it might not be as clear cut and something else might get the axe. In case eviction gets triggered, pods will start being evicted gracefully until the node is no longer under memory pressure. Which will probably include evicting that pod, since by sheer lack of both resources stanzas it fall in the BestEffort QoS class and is a prime candidate.

If it does have a requests stanza, it gets an implicit limit on the capacity of the node. The story doesn't change much. On the eviction side, the pod is in the Burstable QoS class and thus is not a prime candidate for eviction. All BestEffort pods will be evicted first and then Burstable pods.

  • What happens if a pod has a high limit, but it's higher than the node free memory? Does it get killed when the node is out of memory? Does the node die?

First off, if there is no memory request but there is a limit then request = limit. Which means that if that limit is above the Total memory of the node, it won't be schedulable. We 'll see something like the following in kubernetes events.

50s         Warning   FailedScheduling   pod/simple-pod   0/1 nodes are available: 1 Insufficient memory. preemption: 0/1 nodes are available: 1 No preemption victims found for incoming pod.

Now, if there is a requests defined as well and it makes sense, then the node gets overcommitted, which for memory is a bad idea (unlike for CPU). We fall in the same case as the above one. Again Burstable QoS class, same eviction rules. On the Linux kernel, OOMKiller can show up if memory is consumed rapidly enough for eviction to not happen.

Let me be explicit and say: Always put a requests stanza. The scheduler needs to know what to schedule where. Otherwise, it's a free for all and while things will run, experience will be suboptimal.

When to alert and on what

Currently we are alerting based on the total requests, and the available allocatable space in the workers (see https://gitlab.wikimedia.org/repos/cloud/toolforge/alerts/-/blob/main/kubernetes/capacity.yaml?ref_type=heads), but that might not be the right metric (or not the only one).

The goal here is to get alerted when users can't use the service, but only get a warning when the cluster is getting overloaded (imo, up for discussion).

For that:

  • If the cpu requests is bigger than the allocatable space, then users will not be able to get their pods scheduled anywhere, so they will stop running -> page?

Most workloads will continue churning along. It's new ones that won't be able to be scheduled. I 'd argue that is not a page. It's a degraded experience, sure, but a critical alert is good enough, IMHO.

  • If the memory request is bigger than the allocatable space, same, no scheduling happens -> page?

Same argument.

  • If the cpu limits are bigger than the allocatable space, nothing, pods still get scheduled, pods run, get throttled.

That's how production runs. No alert. CPU is a heavily overcommittable resource

  • If the memory limits are bigger than the allocatable space, then nothing either? will the node throttle the pods? (pending open questions above)

There is no throttling for memory. Otherwise, yes no alerts. Rely on eviction and OOMKiller instead.

  • If the worker nodes are getting CPU/load spikes, pods will still run and get scheduled around, though we should consider scaling up the cluster -> warning?

That's a very difficult alert to put in a formula. How many spikes are OK per time period (hour, day, week, month) and how many are not? What duration?

My take is to put a reminder in a shared calendar that once a quarter someone looks at the graphs and makes a determination and suggestion to the rest of the team to increase capacity.

  • If the worker nodes are getting memory usage spikes, pods will start dying and user workloads stopping -> page?

This sounds intuitively good at first glance but it's not. You monitor a resource to proxy for the experience of users, via the workloads. Instead better to watch the API or even better have kube-state-metrics installed and look at kube_pod_container_status_terminated_reason with a reason of OOMKilled

Thanks @akosiaris, this is very helpful :)

So I propose then for limits/requests to do:

  • If the user specifies --cpu/--memory, use that as both request and limit (as the expectation is to use that, no more, no less).
  • If not, then:
    • cpu/request: use a default that is the average of the cluster (currently ~7% currently, that'd be 70m, we can round to 100m for now)
    • cpu/limit: use a big number, smaller than a node, we can do something like 4000m (4 cores, as the workers have 8 cores right now)
    • mem/request: use a small, but not too small number, currently it's 256Mi (limit is 512Mi, we use half of it), though we are using ~60% of the actual memory of the cluster, and the requests are around 80% full, we could make it smaller but I think it's kinda ok already
    • mem/limit: this one is trickier, as we know that toolforge workloads are usually very spiky here, with sudden memory usage and then quiet for a long period. Setting this to the same as the request means that we will never overcommit memory-wise any node, but also that most of the time the memory will not be used and any user having bursty workloads will have to specify the maximum memory used manually. Currently we set it to 512Mi (double the request, well, the request to half the limit).

Then for the alerts:

If the cpu requests is bigger than the allocatable space, then users will not be able to get their pods scheduled anywhere, so they will stop running -> page?

Most workloads will continue churning along. It's new ones that won't be able to be scheduled. I 'd argue that is not a page. It's a degraded experience, sure, but a critical alert is good enough, IMHO.

An issue here is that a lot of our workloads are cronjobs, so cronjobs being unable to trigger means most of the users don't get their workloads running, so I think that not being able to schedule new workload is a critical enough situation to grant a page (at least in the current situation, where we can expand the cluster at will to palliate it in the short term). We can discuss in the team meeting also.

If the worker nodes are getting CPU/load spikes, pods will still run and get scheduled around, though we should consider scaling up the cluster -> warning?

That's a very difficult alert to put in a formula. How many spikes are OK per time period (hour, day, week, month) and how many are not? What duration?

My take is to put a reminder in a shared calendar that once a quarter someone looks at the graphs and makes a determination and suggestion to the rest of the team to increase capacity.

If you can decide by looking at the graphs, you can put it in an alert imo. Better to avoid toil. On the exact values for the alert, well experience tells :), we start with some, see how it goes, tweak it accordingly

If the worker nodes are getting memory usage spikes, pods will start dying and user workloads stopping -> page?

This sounds intuitively good at first glance but it's not. You monitor a resource to proxy for the experience of users, via the workloads. Instead better to watch the API or even better have kube-state-metrics installed and look at kube_pod_container_status_terminated_reason with a reason of OOMKilled

Summarizing in a proposal:

  • page: If user's can't schedule workload
    • measured by the amount of request cpu/mem vs the allocatable one
    • we might be able to find a better metric (ex. time waiting for scheduling or similar)
  • page: If user's workloads are being widely killed
    • measured by the kube_pod_container_status_terminated_reason increase over time (ex. if there's a sustained peak, values to tweak with experience)
  • warning: If the overall cluster load (cpu/mem used) is very high for long
    • measured in the span of a day, if it gets over 80% or any of those, with the recommendation to scale it up

Maybe @CCiufo-WMF can add a product perspective on when to page or not? (until we have proper SLAs :fingerscrossed:)

Thank you for the great summary, I can't meaningfully comment re: k8s specifics, however:

Summarizing in a proposal:

  • page: If user's can't schedule workload
    • measured by the amount of request cpu/mem vs the allocatable one
    • we might be able to find a better metric (ex. time waiting for scheduling or similar)

Yes definitely +1 to an higher level metric for pages, i.e. to catch more problems. If scheduling failures metrics are easily available from toolforge jobs api (or similar) then I think we should go for that. Otherwise a "I was asked to schedule this pod and I can't" type of signal from k8s would work too.

  • page: If user's workloads are being widely killed
    • measured by the kube_pod_container_status_terminated_reason increase over time (ex. if there's a sustained peak, values to tweak with experience)
  • warning: If the overall cluster load (cpu/mem used) is very high for long
    • measured in the span of a day, if it gets over 80% or any of those, with the recommendation to scale it up

Thanks @akosiaris, this is very helpful :)

So I propose then for limits/requests to do:

  • If the user specifies --cpu/--memory, use that as both request and limit (as the expectation is to use that, no more, no less).
  • If not, then:
    • cpu/request: use a default that is the average of the cluster (currently ~7% currently, that'd be 70m, we can round to 100m for now)
    • cpu/limit: use a big number, smaller than a node, we can do something like 4000m (4 cores, as the workers have 8 cores right now)
    • mem/request: use a small, but not too small number, currently it's 256Mi (limit is 512Mi, we use half of it), though we are using ~60% of the actual memory of the cluster, and the requests are around 80% full, we could make it smaller but I think it's kinda ok already
    • mem/limit: this one is trickier, as we know that toolforge workloads are usually very spiky here, with sudden memory usage and then quiet for a long period. Setting this to the same as the request means that we will never overcommit memory-wise any node, but also that most of the time the memory will not be used and any user having bursty workloads will have to specify the maximum memory used manually. Currently we set it to 512Mi (double the request, well, the request to half the limit).

Then for the alerts:

If the cpu requests is bigger than the allocatable space, then users will not be able to get their pods scheduled anywhere, so they will stop running -> page?

Most workloads will continue churning along. It's new ones that won't be able to be scheduled. I 'd argue that is not a page. It's a degraded experience, sure, but a critical alert is good enough, IMHO.

An issue here is that a lot of our workloads are cronjobs, so cronjobs being unable to trigger means most of the users don't get their workloads running, so I think that not being able to schedule new workload is a critical enough situation to grant a page (at least in the current situation, where we can expand the cluster at will to palliate it in the short term). We can discuss in the team meeting also.

In this scenario, are we envisioning a situation where cpu requests > allocatable space with or without the cronjobs? If with, then I argue that we will be seeing is a delay in starting workloads, not an inability to have the workloads run. Which again, isn't paging worthy (but it definitely needs an an alert, probably critical). If without, I have to ask what are the scenarios that would trigger this. Lost so many nodes that we are out of capacity? Some how we scheduled so many non transient tools that

If the worker nodes are getting CPU/load spikes, pods will still run and get scheduled around, though we should consider scaling up the cluster -> warning?

That's a very difficult alert to put in a formula. How many spikes are OK per time period (hour, day, week, month) and how many are not? What duration?

My take is to put a reminder in a shared calendar that once a quarter someone looks at the graphs and makes a determination and suggestion to the rest of the team to increase capacity.

If you can decide by looking at the graphs, you can put it in an alert imo. Better to avoid toil. On the exact values for the alert, well experience tells :), we start with some, see how it goes, tweak it accordingly

Can you though? Cause what I meant is that it's a judgement call, aka something that can not be automated easily. If you can find a way to automate a judgement call that can tell apart 10 short (a few minutes) spikes from 2 12-hour spikes in the course of a quarter while not spending more time developing this automation than cumulatively performing this action every quarter, sure.

group_203_bot_f4d95069bb2675e4ce1fff090c1c1620 opened https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-deploy/-/merge_requests/977

jobs-api: bump to 0.0.415-20250922160844-ff12d380

An issue here is that a lot of our workloads are cronjobs, so cronjobs being unable to trigger means most of the users don't get their workloads running, so I think that not being able to schedule new workload is a critical enough situation to grant a page (at least in the current situation, where we can expand the cluster at will to palliate it in the short term). We can discuss in the team meeting also.

In this scenario, are we envisioning a situation where cpu requests > allocatable space with or without the cronjobs? If with, then I argue that we will be seeing is a delay in starting workloads, not an inability to have the workloads run. Which again, isn't paging worthy (but it definitely needs an an alert, probably critical). If without, I have to ask what are the scenarios that would trigger this. Lost so many nodes that we are out of capacity? Some how we scheduled so many non transient tools that

I think there's some part of the phrase missing :)

Cronjobs don't allocate quota before they start triggering jobs (afaik). One-off jobs and deployments (this is components-api deployments, not k8s deployments) are also transient events.
In this case is I'm thinking on whatever is currently running, not if we ran all the cronjobs at the same time, or all the deployments at the same time, nor if we did not run any, but in a regular workload (where some pods come from deployments, some from jobs, some from cronjobs).

inability to have the workloads run

This is a subjective matter, if your cronjob does not start for the next 5h, is it just a delay or considered so bad that it's just not running? As a point, this weekend we had 3 users asking about why their jobs did not run (right away), or took too much to run (~3-5h). With that in mind, I would lean on the user perception side that 3-5h to wait for a job to run is not acceptable "things are ok" state. Said that, I defer to @CCiufo-WMF for the product perspective, and decide if that's page-able or not.

Note again that I'm not saying if any tool has that issue, but if a considerable percentage of them are experiencing this issue.

If the worker nodes are getting CPU/load spikes, pods will still run and get scheduled around, though we should consider scaling up the cluster -> warning?

That's a very difficult alert to put in a formula. How many spikes are OK per time period (hour, day, week, month) and how many are not? What duration?

My take is to put a reminder in a shared calendar that once a quarter someone looks at the graphs and makes a determination and suggestion to the rest of the team to increase capacity.

If you can decide by looking at the graphs, you can put it in an alert imo. Better to avoid toil. On the exact values for the alert, well experience tells :), we start with some, see how it goes, tweak it accordingly

Can you though? Cause what I meant is that it's a judgement call, aka something that can not be automated easily. If you can find a way to automate a judgement call that can tell apart 10 short (a few minutes) spikes from 2 12-hour spikes in the course of a quarter while not spending more time developing this automation than cumulatively performing this action every quarter, sure.

Definitely :), the point of the alert is to tell us that someone has to look deeper, not to autoscale itself. The point here is not to bypass the human, but avoid having to remember every quarter to do certain task that is not part of your routine, and that might not be needed. Instead do it only when there's a high possibility that it will be useful.

This is a subjective matter, if your cronjob does not start for the next 5h, is it just a delay or considered so bad that it's just not running? As a point, this weekend we had 3 users asking about why their jobs did not run (right away), or took too much to run (~3-5h). With that in mind, I would lean on the user perception side that 3-5h to wait for a job to run is not acceptable "things are ok" state.

From my perspective, a delay more than a couple minutes is unacceptable. Some jobs are scheduled in a time sensitive manner.


If you are going to change the CPU/memory defaults or what the CPU and memory values translate to in k8s requests/limits, please communicate it in substantially in advance of deployment.

Hi @JJMC89, thanks for your comments.

If you are going to change the CPU/memory defaults or what the CPU and memory values translate to in k8s requests/limits, please communicate it in substantially in advance of deployment.

Can you elaborate on this? Specifically on how does it affect your workloads? (there's some changes that are more time sensitive than other and might be needed without much advance notice, for example the cpu requests/limit defaults that's already hitting the cluster allocation space)

Hi @JJMC89, thanks for your comments.

If you are going to change the CPU/memory defaults or what the CPU and memory values translate to in k8s requests/limits, please communicate it in substantially in advance of deployment.

Can you elaborate on this? Specifically on how does it affect your workloads? (there's some changes that are more time sensitive than other and might be needed without much advance notice, for example the cpu requests/limit defaults that's already hitting the cluster allocation space)

I do have some jobs that are time sensitive, but my concern is maintainers not being available to adjust job resources before you deploy this.

As an example, if you lower the default memory, jobs could get OOM killed under the new constraints. Then they would continuously fail until a maintainer is available. This is the most critical type of situation to avoid.

Since I am already aware of your plans, I know in advance and can make any adjustments now. Others, don't have this benefit until you announce your plans and may not be immediately available to make any necessary adjustments.

An issue here is that a lot of our workloads are cronjobs, so cronjobs being unable to trigger means most of the users don't get their workloads running, so I think that not being able to schedule new workload is a critical enough situation to grant a page (at least in the current situation, where we can expand the cluster at will to palliate it in the short term). We can discuss in the team meeting also.

In this scenario, are we envisioning a situation where cpu requests > allocatable space with or without the cronjobs? If with, then I argue that we will be seeing is a delay in starting workloads, not an inability to have the workloads run. Which again, isn't paging worthy (but it definitely needs an an alert, probably critical). If without, I have to ask what are the scenarios that would trigger this. Lost so many nodes that we are out of capacity? Some how we scheduled so many non transient tools that

I think there's some part of the phrase missing :)

Indeed. "that we no longer have capacity left". Sorry about that.

Cronjobs don't allocate quota before they start triggering jobs (afaik). One-off jobs and deployments (this is components-api deployments, not k8s deployments) are also transient events.
In this case is I'm thinking on whatever is currently running, not if we ran all the cronjobs at the same time, or all the deployments at the same time, nor if we did not run any, but in a regular workload (where some pods come from deployments, some from jobs, some from cronjobs).

OK, scenario 1 then, with cronjobs included.

inability to have the workloads run

This is a subjective matter, if your cronjob does not start for the next 5h, is it just a delay or considered so bad that it's just not running? As a point, this weekend we had 3 users asking about why their jobs did not run (right away), or took too much to run (~3-5h). With that in mind, I would lean on the user perception side that 3-5h to wait for a job to run is not acceptable "things are ok" state. Said that, I defer to @CCiufo-WMF for the product perspective, and decide if that's page-able or not.

It definitely is subjective.

Note again that I'm not saying if any tool has that issue, but if a considerable percentage of them are experiencing this issue.

The considerable percentage is something that I don't think we had touched on before (although I might be mistaken on this). OK. So we are now talking about a percentage of pods pending and not being scheduled, not anymore cpu requests > allocatable space. This is a much much more useful metric, as it gauges the experience of end users and not the capacity of the cluster. And here we are:

sum (kube_pod_status_phase{job="k8s-pods", prometheus="k8s", phase="Pending"}) / sum (kube_pod_status_phase{job="k8s-pods", prometheus="k8s"}) > 0.1

Change 0.1 to match the percentage we feel is best to depict "considerable"

If the worker nodes are getting CPU/load spikes, pods will still run and get scheduled around, though we should consider scaling up the cluster -> warning?

That's a very difficult alert to put in a formula. How many spikes are OK per time period (hour, day, week, month) and how many are not? What duration?

My take is to put a reminder in a shared calendar that once a quarter someone looks at the graphs and makes a determination and suggestion to the rest of the team to increase capacity.

If you can decide by looking at the graphs, you can put it in an alert imo. Better to avoid toil. On the exact values for the alert, well experience tells :), we start with some, see how it goes, tweak it accordingly

Can you though? Cause what I meant is that it's a judgement call, aka something that can not be automated easily. If you can find a way to automate a judgement call that can tell apart 10 short (a few minutes) spikes from 2 12-hour spikes in the course of a quarter while not spending more time developing this automation than cumulatively performing this action every quarter, sure.

Definitely :), the point of the alert is to tell us that someone has to look deeper, not to autoscale itself. The point here is not to bypass the human, but avoid having to remember every quarter to do certain task that is not part of your routine, and that might not be needed. Instead do it only when there's a high possibility that it will be useful.

As well to not inundate the humans with alert noise, in order to avoid alert fatigue.

If you think something can be coded to achieve all of the above, go ahead.

Hi @JJMC89, thanks for your comments.

If you are going to change the CPU/memory defaults or what the CPU and memory values translate to in k8s requests/limits, please communicate it in substantially in advance of deployment.

Can you elaborate on this? Specifically on how does it affect your workloads? (there's some changes that are more time sensitive than other and might be needed without much advance notice, for example the cpu requests/limit defaults that's already hitting the cluster allocation space)

I do have some jobs that are time sensitive, but my concern is maintainers not being available to adjust job resources before you deploy this.

What we are talking about here is changing the default requests stanza for CPU here, in order to better reflect the reality of CPU usage by workloads, informing the k8s scheduler better about them, allowing it to make better decisions. It causes no change to how workloads behave or are treated after being scheduled for execution. I don't see why maintainers will need to adjust anything.

As an example, if you lower the default memory, jobs could get OOM killed under the new constraints. Then they would continuously fail until a maintainer is available. This is the most critical type of situation to avoid.

True, but a) only if we are talking about the limit (not the request) and b) the limit has not yet been discussed. Not to my knowledge at least. And there is no patch yet suggesting even remotely a memory change. If anything like that is ever needed to be discussed, we should definitely be proactive and seeking input.

Since I am already aware of your plans, I know in advance and can make any adjustments now. Others, don't have this benefit until you announce your plans and may not be immediately available to make any necessary adjustments.

Up to now, what has been discussed, requires no adjustment from any maintainer.

Hi @JJMC89, thanks for your comments.

If you are going to change the CPU/memory defaults or what the CPU and memory values translate to in k8s requests/limits, please communicate it in substantially in advance of deployment.

Can you elaborate on this? Specifically on how does it affect your workloads? (there's some changes that are more time sensitive than other and might be needed without much advance notice, for example the cpu requests/limit defaults that's already hitting the cluster allocation space)

I do have some jobs that are time sensitive, but my concern is maintainers not being available to adjust job resources before you deploy this.

What we are talking about here is changing the default requests stanza for CPU here, in order to better reflect the reality of CPU usage by workloads, informing the k8s scheduler better about them, allowing it to make better decisions. It causes no change to how workloads behave or are treated after being scheduled for execution. I don't see why maintainers will need to adjust anything.

Limits and requests for CPU and memory are all being discussed.

Maintainers could need to increase the amount requested to ensure that it is available for their job. (Having the resource available more than the request but less than the limit is not guaranteed, so increasing the job's request could be needed.) If you go with 100m CPU default request, I will have specify more to to ensure adequate performance.

As an example, if you lower the default memory, jobs could get OOM killed under the new constraints. Then they would continuously fail until a maintainer is available. This is the most critical type of situation to avoid.

True, but a) only if we are talking about the limit (not the request) and b) the limit has not yet been discussed. Not to my knowledge at least. And there is no patch yet suggesting even remotely a memory change. If anything like that is ever needed to be discussed, we should definitely be proactive and seeking input.

Limits and requests for CPU and memory are all being discussed. Specifically, T404726#11200892 but without saying what the default memory limit would be with a mention of it setting the same as the (new) request, which is what peaked my concern.

Since I am already aware of your plans, I know in advance and can make any adjustments now. Others, don't have this benefit until you announce your plans and may not be immediately available to make any necessary adjustments.

Up to now, what has been discussed, requires no adjustment from any maintainer.

If a new memory default limit is lower than the current one, maintainers may need to take action to ensure jobs don't fail. Even if just the memory request is lower or CPU is lower, some action may be needed to guarantee resource availability.

Hi @JJMC89, thanks for your comments.

If you are going to change the CPU/memory defaults or what the CPU and memory values translate to in k8s requests/limits, please communicate it in substantially in advance of deployment.

Can you elaborate on this? Specifically on how does it affect your workloads? (there's some changes that are more time sensitive than other and might be needed without much advance notice, for example the cpu requests/limit defaults that's already hitting the cluster allocation space)

I do have some jobs that are time sensitive, but my concern is maintainers not being available to adjust job resources before you deploy this.

What we are talking about here is changing the default requests stanza for CPU here, in order to better reflect the reality of CPU usage by workloads, informing the k8s scheduler better about them, allowing it to make better decisions. It causes no change to how workloads behave or are treated after being scheduled for execution. I don't see why maintainers will need to adjust anything.

Limits and requests for CPU and memory are all being discussed.

Maintainers could need to increase the amount requested to ensure that it is available for their job. (Having the resource available more than the request but less than the limit is not guaranteed, so increasing the job's request could be needed.) If you go with 100m CPU default request, I will have specify more to to ensure adequate performance.

The current change on CPU side, that is:

  • Change the default from 250m request/500m limit, to 100m request/4000m limit -> this should not be a problem for any tool, it's actually allowing them to get allocated easier, and to use more cpu if available (that currently it widely is, as the nodes are at ~7% capacity)
  • Change that when user specifies cpu, the limit and the request are set to the same (from setting the request to 1/2 of what the user specified) -> this ensures that if the user requested Xcpu for their job, they will get that, and only that (currently the request is 1/2 of that, so they might get allocated a slot that does not have enough CPU), so in this case there's no need either for users to increase their limits.

So on the CPU side I think we are ok with rolling the changes without lots of notice. This is also right now what's affecting the cluster, preventing pods from getting started, so it's also kinda urgent.

As an example, if you lower the default memory, jobs could get OOM killed under the new constraints. Then they would continuously fail until a maintainer is available. This is the most critical type of situation to avoid.

True, but a) only if we are talking about the limit (not the request) and b) the limit has not yet been discussed. Not to my knowledge at least. And there is no patch yet suggesting even remotely a memory change. If anything like that is ever needed to be discussed, we should definitely be proactive and seeking input.

Limits and requests for CPU and memory are all being discussed. Specifically, T404726#11200892 but without saying what the default memory limit would be with a mention of it setting the same as the (new) request, which is what peaked my concern.

Since I am already aware of your plans, I know in advance and can make any adjustments now. Others, don't have this benefit until you announce your plans and may not be immediately available to make any necessary adjustments.

Up to now, what has been discussed, requires no adjustment from any maintainer.

If a new memory default limit is lower than the current one, maintainers may need to take action to ensure jobs don't fail. Even if just the memory request is lower or CPU is lower, some action may be needed to guarantee resource availability.

For the memory it's a different story yes, if we lower the limits pods might crash when before they did not, so depending on what we do here we will have to give notice, though nothing is decided yet (it's not even clear if something has to change on this regard).

Currently we set it to the default request and limit of 512Mi, if the request is less than the default, or to whatever the user asks for for the limit, and the request 1/2 of that. One proposal is to set the request and the limit to the same thing, to make sure jobs get allocated to a node that has enough memory for the limit, but still under discussion (or at least it's not clear to me xd).

mem/limit: this one is trickier, as we know that toolforge workloads are usually very spiky here, with sudden memory usage and then quiet for a long period. Setting this to the same as the request means that we will never overcommit memory-wise any node, but also that most of the time the memory will not be used and any user having bursty workloads will have to specify the maximum memory used manually. Currently we set it to 512Mi (double the request, well, the request to half the limit).

Anyhow, yes, for changes that might affect pods (ex. lowering the limits) we will notify before hand, so thanks for pointing that out.

Correction for the current default cpu, we set it to 500m request, 500m limit, and it will be set to 100m request, 4000m limit (allowing bursts of usage)

  • Change the default from 250m request/500m limit, to 100m request/4000m limit -> this should not be a problem for any tool, it's actually allowing them to get allocated easier, and to use more cpu if available (that currently it widely is, as the nodes are at ~7% capacity)
  • Change that when user specifies cpu, the limit and the request are set to the same (from setting the request to 1/2 of what the user specified) -> this ensures that if the user requested Xcpu for their job, they will get that, and only that (currently the request is 1/2 of that, so they might get allocated a slot that does not have enough CPU), so in this case there's no need either for users to increase their limits.

So on the CPU side I think we are ok with rolling the changes without lots of notice. This is also right now what's affecting the cluster, preventing pods from getting started, so it's also kinda urgent.

Thanks for summarizing this @dcaro. And absolutely agreed that we really don't need lots of notice on this one. The end result will be that, in the default case, workloads are more likely to be scheduled and will be allowed greater CPU usage than currently. In the non default case, that is when a user has specified what they want, they 'll have a lot higher chance of workloads being scheduled and not being interrupted/evicted.

As an example, if you lower the default memory, jobs could get OOM killed under the new constraints. Then they would continuously fail until a maintainer is available. This is the most critical type of situation to avoid.

True, but a) only if we are talking about the limit (not the request) and b) the limit has not yet been discussed. Not to my knowledge at least. And there is no patch yet suggesting even remotely a memory change. If anything like that is ever needed to be discussed, we should definitely be proactive and seeking input.

Limits and requests for CPU and memory are all being discussed. Specifically, T404726#11200892 but without saying what the default memory limit would be with a mention of it setting the same as the (new) request, which is what peaked my concern.

We might have a different definition of "discussion" here? Cause up to now, all I see is a mention of limits, no concrete values suggested (yet) and no comments. Personally, I am still thinking about this and have no concrete proposal yet. If there any helpful insights regarding values, please do provide them.

Since I am already aware of your plans, I know in advance and can make any adjustments now. Others, don't have this benefit until you announce your plans and may not be immediately available to make any necessary adjustments.

Up to now, what has been discussed, requires no adjustment from any maintainer.

If a new memory default limit is lower than the current one, maintainers may need to take action to ensure jobs don't fail.

Yes. Which means it's doubtful we will, at least not without concrete data backing that decision up and wider communication.

Even if just the memory request is lower or CPU is lower, some action may be needed to guarantee resource availability.

I find that doubtful in the default case. On the contrary, it makes it easier to allocate resources for most workloads. In the non default case, the change summarized by @dcaro above, increases the chances of proper scheduling and continued execution by moving the pods to the Guaranteed QoS class.

For the memory it's a different story yes, if we lower the limits pods might crash when before they did not, so depending on what we do here we will have to give notice, though nothing is decided yet (it's not even clear if something has to change on this regard).

Currently we set it to the default request and limit of 512Mi, if the request is less than the default, or to whatever the user asks for for the limit, and the request 1/2 of that. One proposal is to set the request and the limit to the same thing, to make sure jobs get allocated to a node that has enough memory for the limit, but still under discussion (or at least it's not clear to me xd).

+1

mem/limit: this one is trickier, as we know that toolforge workloads are usually very spiky here, with sudden memory usage and then quiet for a long period. Setting this to the same as the request means that we will never overcommit memory-wise any node, but also that most of the time the memory will not be used and any user having bursty workloads will have to specify the maximum memory used manually. Currently we set it to 512Mi (double the request, well, the request to half the limit).

Anyhow, yes, for changes that might affect pods (ex. lowering the limits) we will notify before hand, so thanks for pointing that out.

group_203_bot_f4d95069bb2675e4ce1fff090c1c1620 opened https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-deploy/-/merge_requests/978

maintain-kubeusers: bump to 0.0.182-20250924074622-dac7fb25

Default cpu requests reduced to 100m, and 1cpu limit, we are currently down to ~65% cpu request allocation (from >80%), should still go down a bit more, people should stop getting stuck for a while now and cpu usage should increase.

I'm sorry, but what measurement unit is m?

Also, little offtopic: is this true that default mem limit now is 1G, maximal limit is 4G, and I need to set -mem:4Gi in launch string to set a highest limit to a certain one-time job?

I'm sorry, but what measurement unit is m?

That's 'milli-cpu', as in 1m = 0.001cpu (details here https://kubernetes.io/docs/concepts/configuration/manage-resources-containers/#resource-units-in-kubernetes).

Also, little offtopic: is this true that default mem limit now is 1G, maximal limit is 4G, and I need to set -mem:4Gi in launch string to set a highest limit to a certain one-time job?

So currenty the default memory is 512M limit + request, this means that it will make sure you have 512M free when finding where to run your job, and it will not allow it to grow over 512M.

For the maximum you can set is currently 6G for a single pod (https://gitlab.wikimedia.org/repos/cloud/toolforge/maintain-kubeusers/-/blob/main/deployment/chart/values.yaml?ref_type=heads#L13), running toolforge jobs quota should show your specific limit (you can request more if you need and have a good use case):

tools.wm-lol@tools-bastion-15:~$ toolforge jobs quota
Running jobs                                  Used    Limit
--------------------------------------------  ------  -------
Total running jobs at once (Kubernetes pods)  1       16
Running one-off and cron jobs                 0       15
CPU                                           0.5     16.0
Memory                                        0.5Gi   8.0Gi

Per-job limits    Used    Limit
----------------  ------  -------
CPU                       3.0
Memory                    6.0Gi

Job definitions                             Used    Limit
----------------------------------------  ------  -------
Cron jobs                                     18       50
Continuous jobs (including web services)       1       16

I see the quota was bumped (https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-deploy/-/blob/main/components/maintain-kubeusers/values/tools.yaml?ref_type=heads#L27) but this does not appear to have been deployed:

tools.cluebotng-review@tools-bastion-15:~$ kubectl describe quota
Name:                   tool-cluebotng-review
Namespace:              tool-cluebotng-review
Resource                Used    Hard
--------                ----    ----
configmaps              2       10
count/cronjobs.batch    15      50
count/deployments.apps  7       16
count/jobs.batch        1       15
limits.cpu              8500m   16
limits.memory           4608Mi  16Gi
persistentvolumeclaims  0       0
pods                    9       25
requests.cpu            925m    16
requests.memory         4352Mi  16Gi
secrets                 29      64
services                6       16
services.nodeports      0       0

Unfortunately that means that now a deployment has happened the quota has effectively halved for cluebotng-review, as the default doubled under https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-deploy/-/commit/f234517cd4a801fa32c79ee085e0647ff327cc5f.

tools.cluebotng-review@tools-bastion-15:~$ kubectl get pod cluebotng-reviewer-76d696dbd8-l6mvx -o json | jq .spec.containers[0].resources
{
  "limits": {
    "cpu": "1",
    "memory": "512Mi"
  },
  "requests": {
    "cpu": "100m",
    "memory": "512Mi"
  }
}

Can maintain-kubeusers get a kick?

Done, I think it probably has updated it already, but if not it will take a minute.

Done, I think it probably has updated it already, but if not it will take a minute.

Looks good, thanks

If there's no more comments, I'll start implementing the alerts and such to close this task.