mw-on-k8s tls-proxy container CPU throttling at low average load
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	Clement_Goubert
	Aug 23 2023, 11:46 AM

Description

As we can see from the following screenshots, our two outwards facing mw-on-k8s deployments are experiencing throttling of the tls-proxy container while their load is nowhere near CPU limit (similar to T342748: mw-on-k8s app container CPU throttling at low average load)

We may want to implement a similar solution (remove CPU limits) if we know envoy CPU isn't going to run away from us. Thoughts?

Details

Subject	Repo	Branch	Lines +/-
mediawiki: Remove tls-proxy CPU limits	operations/deployment-charts	master	+1 -30
kubernetes: Bump envoy image version to 1.23.10-2-s2	operations/puppet	production	+1 -1
mediawiki: Generalize tls-proxy limits removal	operations/deployment-charts	master	+1 -15
mediawiki: Global values override cpu limit removal	operations/deployment-charts	master	+21 -1
mediawiki: Switch to set worker envoy threads	operations/deployment-charts	master	+16 -1
mediawiki: Fix revert	operations/deployment-charts	master	+15 -8
mediawiki: Remove limits for tls-proxy container	operations/deployment-charts	master	+22 -2
mw-debug: Fix resourcequota on namespace	operations/deployment-charts	master	+26 -0
admin_ng: Remove limits on mw-debug namespace	operations/deployment-charts	master	+13 -0
mesh: Add concurrency control for envoy workers	operations/deployment-charts	master	+86 -0
envoy: Add concurrency control to envoy cmdline	operations/docker-images/production-images	master	+14 -1

Related Objects
Search...

Status	Assigned	Task
In Progress	None	T290536 Serve production traffic via Kubernetes
Open	None	T354532 Limit the concurrency of envoy in service mesh
Resolved	Clement_Goubert	T344814 mw-on-k8s tls-proxy container CPU throttling at low average load

Event Timeline

Clement_Goubert created this task.Aug 23 2023, 11:46 AM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptAug 23 2023, 11:46 AM

Clement_Goubert changed the task status from Open to In Progress.Aug 23 2023, 11:47 AM

Clement_Goubert triaged this task as High priority.

Clement_Goubert removed a project: SRE.

Clement_Goubert moved this task from Incoming 🐫 to Doing 😎 on the serviceops board.

Dumping the envoy configuration in one of our containers as well as there being no CLI flag set for it means envoy is setting its number of threads to the number of hardware cores.
In other words, we have a CPU limit of 500mCPU, a 48 core machine, so each worker thread gets ~10mCPU (this is just for illustration purposes because the allocation doesn't work like that).

In effect, we are letting envoy spawn 48 threads and throttling all of them

istio basically takes the CPU limit and rounds up in order to avoid this, but in our case that would give a concurrency of 1, which is probably not what we want (unless we think envoy can handle the connections with only one worker thread).

We can:

Remove CPU limits, leave the concurrency to max, and verify that envoy will use little enough CPU for each connection that it won't matter to the rest of the workload (considering the current CPU usage even with throttling, this may be the fast and easy solution)
Remove CPU limits, set envoy's concurrency to some to-be-determined value
Not touch the limit, set envoy's concurrency to 1, and verify that its non-blocking model is able to serve requests fast enough that latency is better than massively throttled multithreads (this would require careful testing and data gathering)
Find a CPU limit in the single/low double digits we're confident will allow enough threads to serve the requests and make envoy set its concurrency to that
Manually set envoy's concurrency to a value we're comfortable with that isn't all the cores and keep a reasonable CPU limit, iterating to find the spot where there is minimal throttling.

Personally, I think the first solution is the easiest and requires the least amount of finetuning, but it doesn't guarantee us against runaway effects.

Change 952148 had a related patch set uploaded (by Clément Goubert; author: Clément Goubert):

[operations/docker-images/production-images@master] envoy: Add concurrency control to envoy cmdline

https://gerrit.wikimedia.org/r/952148

gerritbot added a project: Patch-For-Review.Aug 24 2023, 9:33 AM

Change 952158 had a related patch set uploaded (by Clément Goubert; author: Clément Goubert):

[operations/deployment-charts@master] mesh: Add concurrency control for envoy workers

https://gerrit.wikimedia.org/r/952158

Change 952159 had a related patch set uploaded (by Clément Goubert; author: Clément Goubert):

[operations/deployment-charts@master] mediawiki: Remove limits for tls-proxy container

https://gerrit.wikimedia.org/r/952159

Change 952171 had a related patch set uploaded (by Clément Goubert; author: Clément Goubert):

[operations/deployment-charts@master] mediawiki: Generalize tls-proxy limits removal

https://gerrit.wikimedia.org/r/952171

Change 952148 merged by Clément Goubert:

[operations/docker-images/production-images@master] envoy: Add concurrency control to envoy cmdline

https://gerrit.wikimedia.org/r/952148

Change 952158 merged by jenkins-bot:

[operations/deployment-charts@master] mesh: Add concurrency control for envoy workers

https://gerrit.wikimedia.org/r/952158

Change 952159 merged by jenkins-bot:

[operations/deployment-charts@master] mediawiki: Remove limits for tls-proxy container

https://gerrit.wikimedia.org/r/952159

Mentioned in SAL (#wikimedia-operations) [2023-08-25T08:44:24Z] <claime> mw-debug: Remove limits for tls-proxy container - T344814

Change 952314 had a related patch set uploaded (by Clément Goubert; author: Clément Goubert):

[operations/deployment-charts@master] admin_ng: Remove limits on mw-debug namespace

https://gerrit.wikimedia.org/r/952314

Change 952314 merged by jenkins-bot:

[operations/deployment-charts@master] admin_ng: Remove limits on mw-debug namespace

https://gerrit.wikimedia.org/r/952314

Change 952316 had a related patch set uploaded (by Clément Goubert; author: Clément Goubert):

[operations/deployment-charts@master] mw-debug: Fix resourcequota on namespace

https://gerrit.wikimedia.org/r/952316

Change 952316 merged by jenkins-bot:

[operations/deployment-charts@master] mw-debug: Fix resourcequota on namespace

https://gerrit.wikimedia.org/r/952316

Change 952319 had a related patch set uploaded (by Clément Goubert; author: Clément Goubert):

[operations/deployment-charts@master] mediawiki: Fix revert

https://gerrit.wikimedia.org/r/952319

Change 952319 merged by jenkins-bot:

[operations/deployment-charts@master] mediawiki: Fix revert

https://gerrit.wikimedia.org/r/952319

Change 952812 had a related patch set uploaded (by Clément Goubert; author: Clément Goubert):

[operations/deployment-charts@master] mediawiki: Switch to set worker envoy threads

https://gerrit.wikimedia.org/r/952812

Change 952812 merged by jenkins-bot:

[operations/deployment-charts@master] mediawiki: Switch to set worker envoy threads

https://gerrit.wikimedia.org/r/952812

Mentioned in SAL (#wikimedia-operations) [2023-08-28T10:10:21Z] <claime> Deploying 952812 for T344814 to mw-debug and mw-api-ext

Change 952823 had a related patch set uploaded (by Clément Goubert; author: Clément Goubert):

[operations/deployment-charts@master] mediawiki: Global values override cpu limit removal

https://gerrit.wikimedia.org/r/952823

Change 952823 merged by jenkins-bot:

[operations/deployment-charts@master] mediawiki: Global values override cpu limit removal

https://gerrit.wikimedia.org/r/952823

Change 952867 had a related patch set uploaded (by Clément Goubert; author: Clément Goubert):

[operations/deployment-charts@master] mediawiki: Remove tls-proxy CPU limits

https://gerrit.wikimedia.org/r/952867

Change 952171 abandoned by Clément Goubert:

[operations/deployment-charts@master] mediawiki: Generalize tls-proxy limits removal

Reason:

Superseded by I94e475715a5abda59ba48421cc61925fa83959da

https://gerrit.wikimedia.org/r/952171

Change 953230 had a related patch set uploaded (by Clément Goubert; author: Clément Goubert):

[operations/puppet@production] kubernetes: Bump envoy image version to 1.23.10-2-s1

https://gerrit.wikimedia.org/r/953230

Change 953230 merged by Clément Goubert:

[operations/puppet@production] kubernetes: Bump envoy image version to 1.23.10-2-s2

https://gerrit.wikimedia.org/r/953230

Mentioned in SAL (#wikimedia-operations) [2023-08-29T10:30:03Z] <claime> Running puppet on deploy servers to bump envoy image version - T344814

Change 952867 merged by jenkins-bot:

[operations/deployment-charts@master] mediawiki: Remove tls-proxy CPU limits

https://gerrit.wikimedia.org/r/952867

Mentioned in SAL (#wikimedia-operations) [2023-08-29T10:39:31Z] <cgoubert@deploy1002> Started scap: Removing mw-on-k8s tls-proxy CPU limits - T344814

Mentioned in SAL (#wikimedia-operations) [2023-08-29T10:41:59Z] <cgoubert@deploy1002> Finished scap: Removing mw-on-k8s tls-proxy CPU limits - T344814 (duration: 02m 27s)

CPU limits have now been removed on all mw-on-k8s deployments except mw-misc. We'll wait a few days to see how the reduced concurrency impacts latency if at all, then resolve this task.

Maintenance_bot removed a project: Patch-For-Review.Aug 29 2023, 11:10 AM

p50 latency increased slightly, we may want to up the concurrency a little to see what shakes.
Example mw-web eqiad

This is tweaking though, so I'm considering this task Resolved.

Clement_Goubert changed the status of subtask T345243: Remove tls-proxy cpu limits on eventstreams from Open to In Progress.Sep 20 2023, 11:03 AM

Clement_Goubert changed the status of subtask T345243: Remove tls-proxy cpu limits on eventstreams from In Progress to Open.Sep 20 2023, 11:05 AM

Clement_Goubert changed the status of subtask T345244: Remove tls-proxy cpu limits on eventgate from Open to In Progress.

JMeybohm mentioned this in T354532: Limit the concurrency of envoy in service mesh.Jan 8 2024, 2:47 PM

Clement_Goubert removed a subtask: T345243: Remove tls-proxy cpu limits on eventstreams.Jan 22 2024, 12:52 PM

Clement_Goubert removed a subtask: T345244: Remove tls-proxy cpu limits on eventgate.

Clement_Goubert removed a subtask: T354532: Limit the concurrency of envoy in service mesh.Jul 8 2024, 11:34 AM

Clement_Goubert added a parent task: T354532: Limit the concurrency of envoy in service mesh.

	F37730389: image.png
	Sep 18 2023, 11:16 AM

mw-on-k8s tls-proxy container CPU throttling at low average loadClosed, ResolvedPublicActions

Description

Details

Related ObjectsSearch...

Event Timeline

mw-on-k8s tls-proxy container CPU throttling at low average load
Closed, ResolvedPublic
Actions

Related Objects
Search...