Page MenuHomePhabricator

Set resource requests and limits for calico PODs
Closed, ResolvedPublic

Description

Currently we lack resource requests and limits for calico pods.

As we now have a baseline of what they need from codfw cluster, we should add them to the chart to have all calico components running in Guaranteed QoS class.

Related Objects

Event Timeline

Change 677906 had a related patch set uploaded (by JMeybohm; author: JMeybohm):

[operations/deployment-charts@master] calico: Add defauls for container resources

https://gerrit.wikimedia.org/r/677906

Change 677906 merged by jenkins-bot:

[operations/deployment-charts@master] calico: Add defauls for container resources

https://gerrit.wikimedia.org/r/677906

This is not exactly looking great on the staging clusters as we can see heavy throttling. The current assumption is that this is caused by the very spiky nature of the work done by the processes here and that we don't see that properly reflected in the prometheus metrics (as the scrape interval is 60s, it's likely that we "miss" the spikes).

While this does not look like a big performance hit to the calico components, I would like to further investigate to at least strengthen theory and/or come up with proper requests and limits (or no limits at all...) before going to production.

I tried to verify the above assumption by collecting metrics more frequently (per second) from the docker API (see P15857). This paints a more clear picture of that happens:

This is also a go-GC problem as there is plenty memory left.

Below is a brain dump from @akosiaris and me regarding the different options and their outcome:

= calico-node daemonset without requests/limits =

* The scheduler is unimportant as far as this pod goes, it's going to be placed on the node anyway
* The pod can not be evicted as it is part of daemonset
* The pod will not be throttled. (As long as there are enough resources)
* The scheduler will not account for the calico node pod when calculating the available resources on the node
** kubectl describe node won't have the resources for that pod listed in the output

= calico-node daemonset with requests but no limits =

* The scheduler is unimportant as far as this pod goes, it's going to be placed on the node anyway
* The pod can not be evicted as it is part of daemonset
* The pod will not be throttled. (As long as there are enough resources)
* The scheduler will account for the calico node pod when calculating the available resources on the node. 

= calico-node daemonset with limits but no requests =

* The scheduler is unimportant as far as this pod goes, it's going to be placed on the node anyway
* The pod can not be evicted as it is part of daemonset
* The pod will be throttled if it goes above the limits usage
* Kubernetes will set the requests to what is defined as limits in this case

= calico-node daemonset with limits and requests but limits > requests =

* The scheduler is unimportant as far as this pod goes, it's going to be placed on the node anyway
* The pod can not be evicted as it is part of daemonset
* The pod will be throttled if it goes above the limits usage or if there are not enough ressources
* The scheduler will account for the calico node pod when calculating the available resources on the node. 

= calico-node daemonset with limits and requests and limits = requests =

* The scheduler is unimportant as far as this pod goes, it's going to be placed on the node anyway
* The pod can not be evicted as it is part of daemonset
* The pod will be throttled if it goes above the limits usage but is guaranteed to not be throttled for resources below the limits
* The scheduler will account for the calico node pod when calculating the available resources on the node.

We decided to remove the CPU limits from calico pods to keep them from throttling. Unfortunately that means they land in the bustable QoS class, but that should be okay if we don't heavily over commit nodes.

See also:

I've looked into typha and kube-controllers component as well as they shot a similar patterns (different magnitude, though).
Unfortunately we lack prometheus metrics for kube-controlles (as they are not available in calico 3.17). Typhas throttling seems mostly related to Go GC and a bunch of go routines pinging each other regularly. I'd assume the same for kube-controllers but I can't be sure currently.


https://grafana.wikimedia.org/dashboard/snapshot/CZ9hSpRtQdaUCvuygZfXwZq4OyeHHTJr

Change 688332 had a related patch set uploaded (by JMeybohm; author: JMeybohm):

[operations/deployment-charts@master] calico: Remove CPU limit for calico-node, bump for typha and kube-controllers

https://gerrit.wikimedia.org/r/688332

Change 688332 merged by jenkins-bot:

[operations/deployment-charts@master] calico: Remove CPU limit for calico-node, bump for typha and kube-controllers

https://gerrit.wikimedia.org/r/688332

Change 688895 had a related patch set uploaded (by JMeybohm; author: JMeybohm):

[operations/deployment-charts@master] calico: Remove limits.cpu instead of requests.cpu for calico-node

https://gerrit.wikimedia.org/r/688895

Change 688895 merged by jenkins-bot:

[operations/deployment-charts@master] calico: Remove limits.cpu instead of requests.cpu for calico-node

https://gerrit.wikimedia.org/r/688895

Calico components are running with resource definitions in all clusters now.