Page MenuHomePhabricator

Investigate resource limits for catalyst wiki environments
Closed, ResolvedPublic2 Estimated Story Points

Description

https://wikitech.wikimedia.org/wiki/Catalyst/Incidents/2025-01-29 was caused by the OOM killer locking up the instance.

We should investigate environment resource limits as a means to prevent the OOM killer from getting triggered in the first place.

  • Determine resource usage of a Catalyst wiki environment (possibly using ab)
  • Try adding resource limits an environment (docs)
  • Ensure resource limits are working and find the failure mode (what happens if the limit is reached from the user and admin perspective)

Details

Related Changes in GitLab:
TitleReferenceAuthorSource BranchDest Branch
Add resource limits to wiki and mariadb containersrepos/test-platform/catalyst/ci-charts!45jhuneidiT385320main
Customize query in GitLab

Event Timeline

thcipriani triaged this task as Medium priority.
thcipriani updated the task description. (Show Details)
thcipriani set the point value for this task to 2.
thcipriani moved this task from Backlog to Ready on the Catalyst (Kiwen) board.

from https://kubernetes.io/docs/concepts/configuration/manage-resources-containers:

cpu limits are enforced by CPU throttling. When a container approaches its cpu limit, the kernel will restrict access to the CPU corresponding to the container's limit. Thus, a cpu limit is a hard limit the kernel enforces. Containers may not use more CPU than is specified in their cpu limit.

memory limits are enforced by the kernel with out of memory (OOM) kills. When a container uses more than its memory limit, the kernel may terminate it. However, terminations only happen when the kernel detects memory pressure. Thus, a container that over allocates memory may not be immediately killed. This means memory limits are enforced reactively. A container may use more memory than its memory limit, but if it does, it may get killed.

wikis "at rest" in the cluster use 1m cpu and under 600mi memory. Directly after wiki creation, the job runner can use a lot of cpu, I noticed ~950m being used, but during discussion we decided it is fine to throttle this.
Current mariadb memory usage in catalyst wikis is not more than 150Mi and cpu usage is also ~1m.

Results of the test with 30,000 requests at concurrency 1:

ab -n30000 -c1 https://test-wiki.catalyst-dev.wmcloud.org/
...
Document Path:          /
Document Length:        0 bytes

Concurrency Level:      1
Time taken for tests:   2482.340 seconds
Complete requests:      30000
Failed requests:        0
Non-2xx responses:      30000
Total transferred:      17613629 bytes
HTML transferred:       0 bytes
Requests per second:    12.09 [#/sec] (mean)
Time per request:       82.745 [ms] (mean)
Time per request:       82.745 [ms] (mean, across all concurrent requests)
Transfer rate:          6.93 [Kbytes/sec] received

Connection Times (ms)
              min  mean[+/-sd] median   max
Connect:        2    5  10.3      4    1035
Processing:    55   78  13.0     77    1093
Waiting:       55   78  13.0     77    1093
Total:         58   83  16.8     81    1127

Percentage of the requests served within a certain time (ms)
  50%     81
  66%     85
  75%     88
  80%     91
  90%     98
  95%    104
  98%    111
  99%    116
 100%   1127 (longest request)

cpu usage spiked to above 1700m

POD                                    NAME           CPU(cores)   MEMORY(bytes)   
test-wiki-mediawiki-849cfc77b7-52mnm   mw-fpm         1737m        264Mi           
test-wiki-mediawiki-849cfc77b7-52mnm   mw-jobrunner   3m           39Mi            
test-wiki-mediawiki-849cfc77b7-52mnm   mw-web         25m          26Mi            
test-wiki-mysql-6ff78c4784-7nwts       mariadb        61m          121Mi

Results of a smaller test with 300 requests:

ab -n300 -c1 https://test-wiki2.catalyst-dev.wmcloud.org/
...
Document Path:          /
Document Length:        0 bytes

Concurrency Level:      1
Time taken for tests:   25.246 seconds
Complete requests:      300
Failed requests:        0
Non-2xx responses:      300
Total transferred:      176419 bytes
HTML transferred:       0 bytes
Requests per second:    11.88 [#/sec] (mean)
Time per request:       84.155 [ms] (mean)
Time per request:       84.155 [ms] (mean, across all concurrent requests)
Transfer rate:          6.82 [Kbytes/sec] received

Connection Times (ms)
              min  mean[+/-sd] median   max
Connect:        3    5   1.4      6      16
Processing:    59   78  15.3     78     285
Waiting:       59   78  15.3     78     284
Total:         63   84  15.6     83     291

Percentage of the requests served within a certain time (ms)
  50%     83
  66%     86
  75%     88
  80%     89
  90%     96
  95%    101
  98%    109
  99%    113
 100%    291 (longest request)

Highest cpu usage was

POD                                     NAME           CPU(cores)   MEMORY(bytes)         
test-wiki2-mediawiki-79cb75465f-xzzgf   mw-fpm         733m         191Mi           
test-wiki2-mediawiki-79cb75465f-xzzgf   mw-jobrunner   1m           41Mi            
test-wiki2-mediawiki-79cb75465f-xzzgf   mw-web         15m          34Mi            
test-wiki2-mysql-96b6d8f7b-k582n        mariadb        17m          116Mi

I'm not sure what a typical usage for the patchdemo wikis entails outside of user acceptance testing. But I think the ab test is far above what anyone is doing with the wikis at the moment.

We could have the default values as follows:
mw-fpm:

requests:
  memory: "400Mi"
  cpu: "1m"
limits
  memory: "800Mi"
  cpu: "750m"

mw-web:

requests
  memory: "50Mi"
  cpu: "1m"
limits
  memory: "100Mi"
  cpu: "100m"

mw-jobrunner:

requests:
    memory: "50Mi"
    cpu: "1m"
limits:
    memory: "100Mi"
    cpu: "750m"

mariadb:

requests
  memory: "150Mi"
  cpu: "1m"
limits
  memory: "300Mi"
  cpu: "250m"
jeena changed the task status from Open to In Progress.Feb 6 2025, 11:07 PM