Page MenuHomePhabricator

mw-on-k8s tls-proxy container CPU throttling at low average load
Closed, ResolvedPublic

Assigned To
Authored By
Clement_Goubert
Aug 23 2023, 11:46 AM
Referenced Files
F37730389: image.png
Sep 18 2023, 11:16 AM
F37621959: image.png
Aug 23 2023, 11:46 AM
F37621947: image.png
Aug 23 2023, 11:46 AM
F37621955: image.png
Aug 23 2023, 11:46 AM
F37621941: image.png
Aug 23 2023, 11:46 AM

Description

As we can see from the following screenshots, our two outwards facing mw-on-k8s deployments are experiencing throttling of the tls-proxy container while their load is nowhere near CPU limit (similar to T342748: mw-on-k8s app container CPU throttling at low average load)

image.png (1×2 px, 200 KB)

image.png (823×2 px, 134 KB)

image.png (1×2 px, 197 KB)

image.png (823×2 px, 151 KB)

We may want to implement a similar solution (remove CPU limits) if we know envoy CPU isn't going to run away from us. Thoughts?

Event Timeline

Clement_Goubert changed the task status from Open to In Progress.Aug 23 2023, 11:47 AM
Clement_Goubert triaged this task as High priority.
Clement_Goubert removed a project: SRE.
Clement_Goubert moved this task from Incoming 🐫 to Doing 😎 on the serviceops board.

Dumping the envoy configuration in one of our containers as well as there being no CLI flag set for it means envoy is setting its number of threads to the number of hardware cores.
In other words, we have a CPU limit of 500mCPU, a 48 core machine, so each worker thread gets ~10mCPU (this is just for illustration purposes because the allocation doesn't work like that).

In effect, we are letting envoy spawn 48 threads and throttling all of them

istio basically takes the CPU limit and rounds up in order to avoid this, but in our case that would give a concurrency of 1, which is probably not what we want (unless we think envoy can handle the connections with only one worker thread).

We can:

  • Remove CPU limits, leave the concurrency to max, and verify that envoy will use little enough CPU for each connection that it won't matter to the rest of the workload (considering the current CPU usage even with throttling, this may be the fast and easy solution)
  • Remove CPU limits, set envoy's concurrency to some to-be-determined value
  • Not touch the limit, set envoy's concurrency to 1, and verify that its non-blocking model is able to serve requests fast enough that latency is better than massively throttled multithreads (this would require careful testing and data gathering)
  • Find a CPU limit in the single/low double digits we're confident will allow enough threads to serve the requests and make envoy set its concurrency to that
  • Manually set envoy's concurrency to a value we're comfortable with that isn't all the cores and keep a reasonable CPU limit, iterating to find the spot where there is minimal throttling.

Personally, I think the first solution is the easiest and requires the least amount of finetuning, but it doesn't guarantee us against runaway effects.

Change 952148 had a related patch set uploaded (by Clément Goubert; author: Clément Goubert):

[operations/docker-images/production-images@master] envoy: Add concurrency control to envoy cmdline

https://gerrit.wikimedia.org/r/952148

Change 952158 had a related patch set uploaded (by Clément Goubert; author: Clément Goubert):

[operations/deployment-charts@master] mesh: Add concurrency control for envoy workers

https://gerrit.wikimedia.org/r/952158

Change 952159 had a related patch set uploaded (by Clément Goubert; author: Clément Goubert):

[operations/deployment-charts@master] mediawiki: Remove limits for tls-proxy container

https://gerrit.wikimedia.org/r/952159

Change 952171 had a related patch set uploaded (by Clément Goubert; author: Clément Goubert):

[operations/deployment-charts@master] mediawiki: Generalize tls-proxy limits removal

https://gerrit.wikimedia.org/r/952171

Change 952148 merged by Clément Goubert:

[operations/docker-images/production-images@master] envoy: Add concurrency control to envoy cmdline

https://gerrit.wikimedia.org/r/952148

Change 952158 merged by jenkins-bot:

[operations/deployment-charts@master] mesh: Add concurrency control for envoy workers

https://gerrit.wikimedia.org/r/952158

Change 952159 merged by jenkins-bot:

[operations/deployment-charts@master] mediawiki: Remove limits for tls-proxy container

https://gerrit.wikimedia.org/r/952159

Mentioned in SAL (#wikimedia-operations) [2023-08-25T08:44:24Z] <claime> mw-debug: Remove limits for tls-proxy container - T344814

Change 952314 had a related patch set uploaded (by Clément Goubert; author: Clément Goubert):

[operations/deployment-charts@master] admin_ng: Remove limits on mw-debug namespace

https://gerrit.wikimedia.org/r/952314

Change 952314 merged by jenkins-bot:

[operations/deployment-charts@master] admin_ng: Remove limits on mw-debug namespace

https://gerrit.wikimedia.org/r/952314

Change 952316 had a related patch set uploaded (by Clément Goubert; author: Clément Goubert):

[operations/deployment-charts@master] mw-debug: Fix resourcequota on namespace

https://gerrit.wikimedia.org/r/952316

Change 952316 merged by jenkins-bot:

[operations/deployment-charts@master] mw-debug: Fix resourcequota on namespace

https://gerrit.wikimedia.org/r/952316

Change 952319 had a related patch set uploaded (by Clément Goubert; author: Clément Goubert):

[operations/deployment-charts@master] mediawiki: Fix revert

https://gerrit.wikimedia.org/r/952319

Change 952319 merged by jenkins-bot:

[operations/deployment-charts@master] mediawiki: Fix revert

https://gerrit.wikimedia.org/r/952319

Change 952812 had a related patch set uploaded (by Clément Goubert; author: Clément Goubert):

[operations/deployment-charts@master] mediawiki: Switch to set worker envoy threads

https://gerrit.wikimedia.org/r/952812

Change 952812 merged by jenkins-bot:

[operations/deployment-charts@master] mediawiki: Switch to set worker envoy threads

https://gerrit.wikimedia.org/r/952812

Mentioned in SAL (#wikimedia-operations) [2023-08-28T10:10:21Z] <claime> Deploying 952812 for T344814 to mw-debug and mw-api-ext

Change 952823 had a related patch set uploaded (by Clément Goubert; author: Clément Goubert):

[operations/deployment-charts@master] mediawiki: Global values override cpu limit removal

https://gerrit.wikimedia.org/r/952823

Change 952823 merged by jenkins-bot:

[operations/deployment-charts@master] mediawiki: Global values override cpu limit removal

https://gerrit.wikimedia.org/r/952823

Change 952867 had a related patch set uploaded (by Clément Goubert; author: Clément Goubert):

[operations/deployment-charts@master] mediawiki: Remove tls-proxy CPU limits

https://gerrit.wikimedia.org/r/952867

Change 952171 abandoned by Clément Goubert:

[operations/deployment-charts@master] mediawiki: Generalize tls-proxy limits removal

Reason:

Superseded by I94e475715a5abda59ba48421cc61925fa83959da

https://gerrit.wikimedia.org/r/952171

Change 953230 had a related patch set uploaded (by Clément Goubert; author: Clément Goubert):

[operations/puppet@production] kubernetes: Bump envoy image version to 1.23.10-2-s1

https://gerrit.wikimedia.org/r/953230

Change 953230 merged by Clément Goubert:

[operations/puppet@production] kubernetes: Bump envoy image version to 1.23.10-2-s2

https://gerrit.wikimedia.org/r/953230

Mentioned in SAL (#wikimedia-operations) [2023-08-29T10:30:03Z] <claime> Running puppet on deploy servers to bump envoy image version - T344814

Change 952867 merged by jenkins-bot:

[operations/deployment-charts@master] mediawiki: Remove tls-proxy CPU limits

https://gerrit.wikimedia.org/r/952867

Mentioned in SAL (#wikimedia-operations) [2023-08-29T10:39:31Z] <cgoubert@deploy1002> Started scap: Removing mw-on-k8s tls-proxy CPU limits - T344814

Mentioned in SAL (#wikimedia-operations) [2023-08-29T10:41:59Z] <cgoubert@deploy1002> Finished scap: Removing mw-on-k8s tls-proxy CPU limits - T344814 (duration: 02m 27s)

CPU limits have now been removed on all mw-on-k8s deployments except mw-misc. We'll wait a few days to see how the reduced concurrency impacts latency if at all, then resolve this task.

p50 latency increased slightly, we may want to up the concurrency a little to see what shakes.
Example mw-web eqiad

image.png (500×1 px, 90 KB)

This is tweaking though, so I'm considering this task Resolved.