Page MenuHomePhabricator

Reserve resources for system daemons on kubernetes nodes
Closed, ResolvedPublic

Description

Currently we allow pods to allocate 100% the resources of a node, which is a bad idea.

We should reserve some CPU, memory maybe storage and PIDs for kubelet (--kube-reserved) as well as for the system itself (--system-reserved). We should also add eviction thresholds.

https://v1-16.docs.kubernetes.io/docs/tasks/administer-cluster/reserve-compute-resources/

Event Timeline

JMeybohm triaged this task as Medium priority.Mar 19 2021, 3:37 PM
JMeybohm created this task.

Change 524186 had a related patch set uploaded (by Alexandros Kosiaris; author: Alexandros Kosiaris):
[operations/puppet@production] kubernetes: Switch to using systemd cgroupdriver

https://gerrit.wikimedia.org/r/524186

Change 524186 abandoned by JMeybohm:

[operations/puppet@production] kubernetes: Switch to using systemd cgroupdriver

Reason:

Docker will switch automatically to systemd cgroup driver on cgroupv2 systems

https://gerrit.wikimedia.org/r/524186

From wayback machine, the magic google formula is:

Allocatable resources are calculated in the following way:

ALLOCATABLE = CAPACITY - RESERVED - EVICTION-THRESHOLD

For memory resources, GKE reserves the following:

255 MiB of memory for machines with less than 1 GiB of memory
25% of the first 4 GiB of memory
20% of the next 4 GiB of memory (up to 8 GiB)
10% of the next 8 GiB of memory (up to 16 GiB)
6% of the next 112 GiB of memory (up to 128 GiB)
2% of any memory above 128 GiB

For CPU resources, GKE reserves the following:

6% of the first core
1% of the next core (up to 2 cores)
0.5% of the next 2 cores (up to 4 cores)
0.25% of any cores above 4 cores

Summary from a realtime discussion with @JMeybohm

  • Using --system-reserved is kind of dangerous, because it uses cgroups and may lead to OOM/resource starvation for system processes (cf 2nd paragraph of the kubernetes doc on general guidelinges for reserving compute resources)
  • Using --kube-reserved makes sense to avoid runaway effects, but care needs to be taken that kubelet and the container runtime share the same cgroup, and the limits should be set once metrics have been gathered. It also means that kubernetes daemons may get oom killed or CPU starved.
  • Calculating the CPU resource reservation using the above magic formula turns out to be very expensive in CPU, ~20% (or ~9.5 cores) for a 48 core CPU. I suspect this is designed for a very active public cloud cluster where workloads are being moved or redeployed constantly. Looking at the unit CPU usage dashboard, excluding etcd which does not run on kubelets, we are using around 1 core for critical resources.

The guidelines are pretty clear that the first thing to do is actually use --enforce-node-allocatable=pods in combination with minimum-eviction-reclaim so the scheduler actually starts enforcing Allocatable as the available capacity for pods.

We should then monitor how much we evict pods based on current capacity calculation, which should be very little given the current load on the cluster.

Then, in order to actually reserve resources for system and kubernetes daemons without the risk of starvation we could:

  • Use --reserved-cpus to clear 2 to 5 cores (depending on server core count) for kubelet and system usage only, while still giving them the possibility to burst on more than these cores. This would reduce the allocatable capability, but would guarantee system functionality. Reference
  • Use --eviction-hard=[memory.available<XXXMi] to evict pods when available memory becomes tight, avoiding the possible OOM constraints of cgroup-based system resource reservation while making sure the system has ample memory to run. Reference

Both of these options will substract resources from Allocatable, giving the scheduler a better idea of the actual available capacity without running us into a risk of resource starvation for system and kubernetes daemons.

--enforce-node-allocatable=pods is already enabled (by default) but the design document says: "This flag will be a no-op unless --kube-reserved and/or --system-reserved has been specified."
I took a look at available vs. allocatable in the current state and allocatable memory is 100MiB less than available on all nodes. Which is the default evictionHard limit for memory.

We currently lack some insights as T108027: Collect per-cgroup cpu/mem and other system level metrics is not yet completed on k8s clusters, because of T337836: Cadvisor may be breaking Kubernetes worker nodes. So we currently have metrics for wikikube staging and aux clusters but no data on the "big ones".
I put together a copy of the unit resource usage dashboard for k8s clusters at https://grafana-rw.wikimedia.org/d/yB3As9eVz/jayme-k8s-system-reserved

Also I re-read all the things and I think we got it wrong. AIUI now --system-reserved and --kube-reserved will not be enforced by default (e.g., they just inform the scheduler). Both options have a corresponding -cgroup= flag documented as:

To optionally enforce system-reserved on system daemons, specify the parent control group for OS system daemons as the value for --system-reserved-cgroup kubelet flag.

It is recommended that the OS system daemons are placed under a top level control group (system.slice on systemd machines for example).

Note that kubelet does not create --system-reserved-cgroup if it doesn't exist. kubelet will fail if an invalid cgroup is specified.

I will double check in code (or staging), but I'm pretty sure that means we can safely reserve cpu/memory and maybe pids for kubelet and system to inform the scheduler (and not pack nodes too tight). Enforcing reserved resources is probably not what we need or want.

I will double check in code (or staging), but I'm pretty sure that means we can safely reserve cpu/memory and maybe pids for kubelet and system to inform the scheduler (and not pack nodes too tight). Enforcing reserved resources is probably not what we need or want.

I did ran kubelet with --system-reserved=cpu=1,memory=5Gi --kube-reserved=cpu=1,memory=5Gi in staging, the system.slice and kubelet.service crgoups where untouched as well as the cgroup hierarchy. Just the allocatable resources of the node where lowered accordingly.

I looked at the data collected from workers since yesterday and they are (to no surprise) well below what the GKE formula would spit out - so I tried to adapt that a bit to match our reality. In general I would say we only set --system-reserved and don't discriminate between kube services and the rest - it makes it easier to reason about IMHO.

So for memory the GKE formular seems to make some sense for our environment:

  • 25% of the first 4 GiB of memory == 1GiB, 1049000Ki
  • 20% of the next 4 GiB of memory (up to 8 GiB) == 0.8Gi, 839200Ki
  • 10% of the next 8 GiB of memory (up to 16 GiB) == 0.8Gi, 839200Ki
  • 6% of the next 112 GiB of memory (up to 128 GiB) (I've added a column with 3% as well)

For CPU I'm not so sure as the GKE values are super high compared to what we see, I tried:

  • 8% of the first 4 CPUs
  • 0.01% of any CPUs above 4

The tables below contain one node per type, the max value is across all nodes of that type.
CPU might not catch short spikes as those values are irate[5m].
Memory is only RSS from services (cadvisor systemd), we should also account for kernel memory and leave some headroom.
Keep in mind that kuberneets1005 and ml-serve1001 do not manage a lot of Pods as of now so the actual values might not be realistic for the future.

nameavailable mem (Ki)available mem (Gi)GKE reserved mem (Gi)3% of next 112actual max used mem (Gi)
kubernetes1005.eqiad.wmnet40102443.80.960.960.81
kubernetes1007.eqiad.wmnet9738218892.97.214.914.6
kubernetes1018.eqiad.wmnet131612596125.59.175.894
ml-serve1001.eqiad.wmnet131602944125.59.175.893
nameavailable CPUGKE reserved CPU %GKE reserved CPU0.08 first 4 CPUs + 0.01 for the followingactual max used CPU
kubernetes1005.eqiad.wmnet150.10751.61251.20.8
kubernetes1007.eqiad.wmnet480.199.124.11.5
kubernetes1018.eqiad.wmnet400.176.83.31.1
ml-serve1001.eqiad.wmnet720.25186.20.83

Change 949843 had a related patch set uploaded (by JMeybohm; author: JMeybohm):

[operations/puppet@production] k8s: Reserve system resources on k8s workers

https://gerrit.wikimedia.org/r/949843

Just to clarify, the above patch will not lead to evictions (unless we have a worker with less than 300MB of free RAM, which I doubt).
It however may lead to more scheduling difficulties because we are lowering the Allocatable values for the workers, so the scheduler has less options.

It's easy to revert or change if that's the case until we have more hardware headroom.

Just to clarify, the above patch will not lead to evictions (unless we have a worker with less than 300MB of free RAM, which I doubt).
It however may lead to more scheduling difficulties because we are lowering the Allocatable values for the workers, so the scheduler has less options.

It's easy to revert or change if that's the case until we have more hardware headroom.

While that is correct, it could leave us with a cluster where we can't schedule anything - which will actually be the case with the current numbers if I'm not mistaken. Taking only the non sessionstore nodes into account we currently have ~47 CPUs available (e.g. not requested by a container). With the implemented calculation we would reserve ~60 CPUs. I bet that a reevaluation of CPU requests vs. actual usage would give us a bit more room, but maybe not enough.

What we could do right now is to cap the maximum CPU reservation at 2 CPUs (so basically reserving 0.96 CPUs on sessionstore nodes and 2 on all the others) and switch back to the proper calculation when we have the headroom to do so.

Considering there's no reservation for system resources at the moment, I feel like that would be a better solution than doing nothing, especially as we increase requests for T342748

Considering there's no reservation for system resources at the moment, I feel like that would be a better solution than doing nothing, especially as we increase requests for T342748

I agree. Updated the patch accordingly.

Change 949843 merged by JMeybohm:

[operations/puppet@production] k8s: Reserve system resources on k8s workers

https://gerrit.wikimedia.org/r/949843

Change 951065 had a related patch set uploaded (by JMeybohm; author: JMeybohm):

[operations/puppet@production] kubernetes::node: Don't reserve CPUs for system

https://gerrit.wikimedia.org/r/951065

Change 951065 merged by JMeybohm:

[operations/puppet@production] kubernetes::node: Don't reserve CPUs for system

https://gerrit.wikimedia.org/r/951065

Change 959164 had a related patch set uploaded (by JMeybohm; author: JMeybohm):

[operations/puppet@production] kubernetes::node: Reserve CPU resources for system daemons

https://gerrit.wikimedia.org/r/959164

Change 959164 merged by JMeybohm:

[operations/puppet@production] kubernetes::node: Reserve CPU resources for system daemons

https://gerrit.wikimedia.org/r/959164

JMeybohm claimed this task.

This is active since yesterday