Page MenuHomePhabricator

mw-on-k8s app container CPU throttling at low average load
Closed, ResolvedPublic

Description

Problem first raised in T342252: Migrate rdf-streaming-updater to connect to mw-on-k8s

All our mw-on-k8s deployments are experiencing significant throttling at low CPU load. For instance for mw-web, the main deployment in eqiad gets throttled for up to 5ms while using less than 1/10th of its CPU quota.

This is probably due to how CPU quota gets calculated through timeslots, see:
https://medium.com/@betz.mark/understanding-resource-limits-in-kubernetes-cpu-time-9eff74d3161b
https://medium.com/indeed-engineering/unthrottled-how-a-valid-fix-becomes-a-regression-f61eabb2fbd9

I propose we remove the CPU limit for the app container in our mw-on-k8s deployments.

Related Objects

Event Timeline

Clement_Goubert created this task.
Clement_Goubert moved this task from Incoming 🐫 to this.quarter 🍕 on the serviceops board.

In my opinion, we need to fix this before moving forward with migrating more traffic to mw-on-k8s.

Clement_Goubert renamed this task from mw-on-k8s php-fpm container CPU throttling at low average load to mw-on-k8s app container CPU throttling at low average load.Jul 26 2023, 11:46 AM
Clement_Goubert updated the task description. (Show Details)

We were consistently throttled until we set limits == FPM worker count. Per the description (and Dan Luu's insightful foray[1]) into the topic, I don't think there is much that can be done besides adjusting or removing the limits or tweaking the CFS period that k8s uses. Removing the limits is probably fine given that the size of the worker pool is a natural upper bound on concurrency with pm = static.


[1] https://danluu.com/cgroup-throttling/

Thanks @TK-999 indeed for a PHP application that doesn't shell out much like MediaWiki the number of workers is a hard limit on the amount of CPUs it can use, which is roughly around 1 CPU per worker, with typical usage for a mediawiki cluster around 0.1 seconds per worker in the mediawiki appserver cluster.

So in practice we might want to raise slightly the cpu requests for a mediawiki pod, and possibly remove the limits.

Reducing the CFS quota period from 100 ms to something like 10 ms also probably makes sense.

How I evaluated the current seconds per worker:

I used the following formula in promQL:

sum(rate(container_cpu_usage_seconds_total{cluster="$cluster", id=~"/system.slice/php7\\.4-fpm.service"}[5m]))/ sum(phpfpm_statustext_processes{site="eqiad",service="php7.4-fpm.service",cluster="$cluster"})

the values I found are:

  • ~ 0.05-0.15 for appservers
  • ~ 0.2-0.25 for apis and jobrunners
  • ~ 0.4-0.5 for parsoid

Thanks for the insight @TK-999
When you say "limits == FPM worker count", do you mean one whole CPU per worker? Did you use pinning as well?
As I understand it, even using whole CPU counts matching process count, we would still see some (but probably less) throttling due to the CFS timeslot mechanism.

@Joe So we would set request for the app container to something like:

  • mw-web 100m*nb_workers
  • mw-api-* 200m*nb_workers

Then pod request a bit higher than that (to take into account sidecars), and remove limits for the main container, sidecars, and the whole pod?

@Clement_Goubert Yeah, we currently set a limit of 1 CPU per worker. We have not experimented with pinning.

In practice, this keeps throttling at < 0.25% - likely because even if a pod sees 100% process utilization, those processes might be waiting on I/O or otherwise not utilizing the CPU time budget.

We had a long but productive discussion with @JMeybohm this morning, resulting in a tentative plan of action:

  1. Graph the global latency of wikikube hosted services. This is not useful as a raw number, but if it's not too spiky, a variation should help us spot if mediawiki is being too noisy of a neighbor
  2. T277876: Reserve resources for system daemons on kubernetes nodes should be completed before removing limits on all mw-on-k8s deployments, so we avoid system resource starvation under spikes once baseload has increased. It has been updated with an implementation proposal. This is not a blocker per-se.
  3. Remove limits on mw-api-int. We would not be raising worker count, and keep requests where they are for the php-fpm container (.5 CPUs/worker).
  4. Let that run for a week or so to get actionable intel on behaviour.
  5. If it's conclusive, remove limits for the php-fpm container on all mw-on-k8s deployments.

A question we were left with was if we can load-test ourself against a single pod IP to check behavior quicker. I know a load test of mw-on-k8s has been discussed with Performance-Team, maybe we could collaborate on that?

Change 943560 had a related patch set uploaded (by Clément Goubert; author: Clément Goubert):

[operations/deployment-charts@master] mediawiki: set requests based on php.workers

https://gerrit.wikimedia.org/r/943560

Clement_Goubert changed the task status from Stalled to In Progress.Aug 8 2023, 1:37 PM

Change 943560 merged by jenkins-bot:

[operations/deployment-charts@master] mediawiki: set requests based on php.workers

https://gerrit.wikimedia.org/r/943560

Change 947792 had a related patch set uploaded (by JMeybohm; author: JMeybohm):

[operations/deployment-charts@master] Revert "mediawiki: set requests based on php.workers"

https://gerrit.wikimedia.org/r/947792

Change 947792 merged by jenkins-bot:

[operations/deployment-charts@master] Revert "mediawiki: set requests based on php.workers"

https://gerrit.wikimedia.org/r/947792

Change 949957 had a related patch set uploaded (by Clément Goubert; author: Clément Goubert):

[operations/deployment-charts@master] mediawiki: Set requests based on php.workers

https://gerrit.wikimedia.org/r/949957

Change 950138 had a related patch set uploaded (by Clément Goubert; author: Clément Goubert):

[operations/deployment-charts@master] mediawiki: Add exporter limits and requests

https://gerrit.wikimedia.org/r/950138

Change 950138 merged by jenkins-bot:

[operations/deployment-charts@master] mediawiki: Add exporter limits and requests

https://gerrit.wikimedia.org/r/950138

Change 949957 merged by jenkins-bot:

[operations/deployment-charts@master] mw-api-int: Set requests based on php.workers

https://gerrit.wikimedia.org/r/949957

Limitless php containers deployed today on mw-api-int.
Next week we will reintroduce memory limits, and extend the deployment of limitless CPU for all mw-on-k8s deployments except mw-debug and mw-misc

Change 950177 had a related patch set uploaded (by Clément Goubert; author: Clément Goubert):

[operations/deployment-charts@master] mediawiki: Reduce memory request

https://gerrit.wikimedia.org/r/950177

Change 950177 merged by jenkins-bot:

[operations/deployment-charts@master] mediawiki: Reduce memory request

https://gerrit.wikimedia.org/r/950177

Change 951045 had a related patch set uploaded (by Clément Goubert; author: Clément Goubert):

[operations/deployment-charts@master] mediawiki: Allow autocomputing the memory limit

https://gerrit.wikimedia.org/r/951045

Change 951051 had a related patch set uploaded (by Clément Goubert; author: Clément Goubert):

[operations/deployment-charts@master] mediawiki: Autocompute requests and limits for all

https://gerrit.wikimedia.org/r/951051

Change 951052 had a related patch set uploaded (by Clément Goubert; author: Clément Goubert):

[operations/deployment-charts@master] mw-api-int: autocompute memory limit

https://gerrit.wikimedia.org/r/951052

Change 951045 merged by jenkins-bot:

[operations/deployment-charts@master] mediawiki: Allow autocomputing the memory limit

https://gerrit.wikimedia.org/r/951045

Mentioned in SAL (#wikimedia-operations) [2023-08-21T10:59:49Z] <claime> Deploying memory limit autocompute for mw-on-k8s - T342748

Change 951052 merged by jenkins-bot:

[operations/deployment-charts@master] mw-api-int: autocompute memory limit

https://gerrit.wikimedia.org/r/951052

Mentioned in SAL (#wikimedia-operations) [2023-08-21T11:02:07Z] <claime> Enabling memory limit autocompute for mw-api-int - T342748

Change 951051 merged by jenkins-bot:

[operations/deployment-charts@master] mediawiki: Autocompute requests and limits for all

https://gerrit.wikimedia.org/r/951051

Mentioned in SAL (#wikimedia-operations) [2023-08-21T13:42:48Z] <claime> Enabling memory limit autocompute for all mw-on-k8s deployments - T342748

Change 951125 had a related patch set uploaded (by Clément Goubert; author: Clément Goubert):

[operations/deployment-charts@master] mw-misc: Enforce fixed requests and limits

https://gerrit.wikimedia.org/r/951125

Change 951125 merged by jenkins-bot:

[operations/deployment-charts@master] mw-misc: Enforce fixed requests and limits

https://gerrit.wikimedia.org/r/951125

Mentioned in SAL (#wikimedia-operations) [2023-08-21T13:55:16Z] <claime> Re-enforcing limits and requests for mw-misc - T342748

All deployments of mw-on-k8s are now using:

  • Autocomputed CPU requests, no limits
  • Autocomputed Memory requests and limits

Only exception is mw-misc, which is on fixed requests and limits.
For future reference without having to go check the chart, the resource computation is:

  • Requests:
# CPU calculation:
# Minimum 1 whole CPU
# Multiply the amount of cpu_per_worker (float, unit: cpu, ex: 0.5 is half a CPU per worker)                                                                                      
# by the number of configured workers + 1 (to take into account the main php-fpm process)                                                                                         
# RAM calculation:
# Multiply 50% of the amount of memory_per_worker by the number of workers (ignoring the main php-fpm process)                                                                    
# Add 50% of the opcache size and the apc size (close to the average real consumption)
  • Limits:
# RAM calculation:
# Multiply the amount of memory_per_worker by the number of workers (ignoring the main php-fpm process)                                                                           
# Add 50% of the opcache size and the apc size (close to the average real consumption)

Everything looking ok, we will see how it copes with doubling the incoming traffic from T341780: Direct 5% of all traffic to mw-on-k8s (only going to 2% for now) and resolve afterwards if everything stays ok.