Page MenuHomePhabricator

Investigate namespace limits (in addition to pod limits) for wiki environments
Closed, ResolvedPublic

Description

We can also add resource requests and limits to the namespace itself: https://kubernetes.io/docs/tasks/administer-cluster/manage-resources/quota-memory-cpu-namespace/

Let's investigate namespace limits as well as pod limits for wiki environments.

The goal would be to have more permissive pod-level limits and stricter namespace-level limits to allow for spiky load with individuals pods.

Questions to be answered:

  • Is the goal of tighter namespace limits workable?
  • What happens when we hit a namespace limit?
  • Where do these limits get applied? (Currently cat-env namespace is created by the Catalyst-API)
  • Is the behavior k8s takes when you hit limits configurable in any way?

Details

Related Changes in GitLab:
TitleReferenceAuthorSource BranchDest Branch
add namespace-level quotas for cpu and memoryrepos/test-platform/catalyst/catalyst-api!99jnucheT386008main
Customize query in GitLab

Event Timeline

Is the goal of tighter namespace limits workable?

Yes, but not using a ResourceQuota but a LimitRange. The defaults in the LimitRange are applied to any pod that doesn't define requests/limits itself. A pod definition can then override those defaults and give itself more lenient values.

What happens when we hit a namespace limit?

When hitting limits defined in a ResourceQuota, the cluster will refuse to schedule the new pod, i.e. if the pod's requests or limits would make the current namespace usage go above the values defined in ResourceQuota

Where do these limits get applied? (Currently cat-env namespace is created by the Catalyst-API)

In namespace cat-env, where the wiki envs (and any other Catalyst env) are deployed

Is the behavior k8s takes when you hit limits configurable in any way?

Unfortunately not

Snapshot of usage in prod today:

$ kubectl -n cat-env top pod
NAME                                                CPU(cores)   MEMORY(bytes)   
vw-24-01-10-98-artifact-warehouse                   0m           0Mi             
vw-24-01-10-98-js-evaluator-56466cdd5f-92rzb        9m           99Mi            
vw-24-01-10-98-mariadb-55fb5f7485-f472n             1m           359Mi           
vw-24-01-10-98-mediawiki-865867ccd5-8p7jc           9m           365Mi           
vw-24-01-10-98-py-evaluator-788b46d75-lhrn9         8m           86Mi            
vw-25-03-06-240-artifact-warehouse                  1m           1Mi             
vw-25-03-06-240-js-evaluator-597655b6d-979dt        9m           89Mi            
vw-25-03-06-240-mariadb-6dc6b7689c-p5txg            1m           542Mi           
vw-25-03-06-240-mediawiki-7bb66dcc4f-pmr72          8m           365Mi           
vw-25-03-06-240-py-evaluator-74df798dd4-q68j7       9m           90Mi            
wiki-041e77e571-168-mariadb-5c98759c74-x97r6        1m           96Mi            
wiki-041e77e571-168-mediawiki-5d5fbb954d-6cd85      1m           347Mi           
wiki-0bb5b459a2-154-mariadb-85f57d7b4d-n7bgs        1m           124Mi           
wiki-0bb5b459a2-154-mediawiki-79c9b5d766-njjrm      1m           640Mi           
wiki-120212ac9b-153-mariadb-c6f8f6d8d-6d2bp         6m           131Mi           
wiki-120212ac9b-153-mediawiki-5bb7986497-8v8z6      129m         668Mi           
wiki-1d930405ae-150-mariadb-567cc786f6-pllsb        1m           125Mi           
wiki-1d930405ae-150-mediawiki-66f958bbc5-r8dhz      1m           681Mi           
wiki-3ac40ce6e5-173-mariadb-5fddfb6c7b-jvkbf        1m           118Mi           
wiki-3ac40ce6e5-173-mediawiki-6bc79556d-8czd5       1m           478Mi           
wiki-4073746e42-210-mariadb-767b9744f4-vh5d8        1m           136Mi           
wiki-4073746e42-210-mediawiki-5ddbd966cc-lxgjh      1m           517Mi           
wiki-52570609ab-148-mariadb-675875db5b-pctbq        6m           124Mi           
wiki-52570609ab-148-mediawiki-558bd79984-dnh5d      135m         621Mi           
wiki-5648f3da62-146-mariadb-5659b5b76d-glkz2        1m           70Mi            
wiki-5648f3da62-146-mediawiki-6d69944fb7-hwvd9      1m           49Mi            
wiki-6682c16744-99-mariadb-6d9c56bf8b-hds9r         6m           159Mi           
wiki-6682c16744-99-mediawiki-ff8ff6578-vj8rk        116m         643Mi           
wiki-6c4ff921f6-157-mariadb-cbc7c5dfb-zmm9z         1m           104Mi           
wiki-6c4ff921f6-157-mediawiki-f4d767699-rk6q8       1m           179Mi           
wiki-aa4aee753a-106-mariadb-58c999bf44-wjvhv        2m           131Mi           
wiki-aa4aee753a-106-mediawiki-6498f4c7c8-xgqwp      27m          606Mi           
wiki-afbbeb3389-132-mariadb-67764fc86b-ldr6q        1m           73Mi            
wiki-afbbeb3389-132-mediawiki-7c75dcf679-ld82k      1m           49Mi            
wiki-b7e51dec29-141-mariadb-7f4bcffb8b-59tzc        1m           130Mi           
wiki-b7e51dec29-141-mediawiki-6ff9954677-wrhpr      1m           596Mi           
wiki-c0f9f8242b-149-mariadb-5cf9574b76-c7rtm        6m           135Mi           
wiki-c0f9f8242b-149-mediawiki-74c6798546-v55vh      138m         596Mi           
wiki-c66647917e-133-mariadb-56b6cb4b5d-76kbx        1m           74Mi            
wiki-c66647917e-133-mediawiki-7f77f69b9d-ltn8g      1m           54Mi            
wiki-c675fe2606-140-mariadb-7666449c86-nf8s5        6m           134Mi           
wiki-c675fe2606-140-mediawiki-7f8c48586c-vqxvq      120m         606Mi           
wiki-eae4aa15e9-253-mariadb-545669b6c4-689wl        1m           99Mi            
wiki-eae4aa15e9-253-mediawiki-fbc57ddff-2pxff       1m           157Mi           
wiki-ecfb8a7b3a-100-mariadb-9f9f7847d-nfjvf         6m           128Mi           
wiki-ecfb8a7b3a-100-mediawiki-6686d5fd75-2w65q      110m         639Mi           
wikilambda-ci-1-172-artifact-warehouse              0m           3Mi             
wikilambda-ci-1-172-js-evaluator-57854f59f6-86f8k   9m           93Mi            
wikilambda-ci-1-172-mariadb-798dc96885-r4brr        1m           135Mi           
wikilambda-ci-1-172-mediawiki-5bcb78b776-tz5rz      10m          339Mi           
wikilambda-ci-1-172-py-evaluator-679fbdb94d-b99ld   8m           87Mi   

With averages:

$ kubectl -n cat-env top pod | awk '{print $2}' | sed 's/m//' | awk '{ sum += $1 } END { print sum/NR }'
17.6538
$ kubectl -n cat-env top pod | awk '{print $3}' | sed 's/Mi//' | awk '{ sum += $1 } END { print sum/NR }'
247.692

The prod host has 16 CPUs and 32G. Pods will require more memory than CPU relatively speaking, so of the two resources memory is the bottleneck.

If we assign e.g. 28G as global memory request and assume 250M per pod and 2 pods per env (that's the wiki envs pods) then we get a rough calculation of: 28×1024÷(250×2) = 57,344 environments can be created before the cluster starts rejecting new pods. We can then set the memory limit to e.g. 30G. 57 environments is not a lot, but it's better to have this limit in place than having the machine suffering OOM errors.

A similar calculation for the CPU assuming 20m CPUs per container yields: 8000÷(20×2) = 200. So in this case we can get away with using only 8 CPUs for requests and 10 for limit bursts.

This MR uses these calculations to set the values: https://gitlab.wikimedia.org/repos/test-platform/catalyst/catalyst-api/-/merge_requests/99

There's another important factor, Patchdemo runs on the same host and it's currently using 5G of memory:

$ kubectl -n patchdemo top pod
NAME                         CPU(cores)   MEMORY(bytes)   
patchdemo-6d68cc9994-4nhh2   430m         4199Mi          
patchdemo-mariadb-0          46m          845Mi 

We can trust that once Catalyst is the default backend, the Patchdemo environments will start going away and free up more memory for Catalyst environments. Or maybe we want to be more proactive: add more memory to the host or migrate Patchdemo somewhere else

Some very conservative estimations show that based on this usage, Catalyst could support a maximum of 38 environments

Deployed to prod