Limit the concurrency of envoy in service mesh
Open, Needs TriagePublic
Actions

Assigned To

None

Authored By

	JMeybohm
	Jan 8 2024, 2:47 PM

Description

From T344814: mw-on-k8s tls-proxy container CPU throttling at low average load we've learned that removing limits from the service mesh while setting a fixed envoy concurrency (of 12) will remove throttling (obviously) without causing run away CPU usage of envoy.

We do have other services potentially suffering from envoy throttling (T345243, T345244, T353460) where we might not want to remove limits altogether. According to research (https://wikitech.wikimedia.org/wiki/Kubernetes/Resource_requests_and_limits#envoy) istio improves this by generally setting envoy concurrency more suited to the actual CPU limit set for the container (as max(ceil(<cpu-limit-in-whole-cpus>), 2)).

In T353460 we did a small test running envoy with 500m CPU limit with default concurrency (first spike) and with concurrency set to 2 (second spike):

Screenshot_20240108_153039.png (886×1 px, 139 KB)

While this paints a pretty clear picture in terms of throttling it seems to have also increased latency quite significantly (as in more than doubled it). Unfortunately there is a big spread in latency (depending on the type of request) at the upstream side as well, so it can't be said for sure hot big envoys role is from the current data.

Details

	Subject	Repo	Branch	Lines +/-
	mw-api-int: Double envoy concurrency	operations/deployment-charts	master	+1 -0

Customize query in gerrit

Related Objects
Search...

Status	Assigned	Task
Stalled	None	T255792 Quibble runs core:unit tests twice!
Open	None	T328919 Upgrade to PHPUnit 10
Open	None	T338103 Micro-optimize ApiResult::isMetadataKey with str_starts_with once we support PHP8+
Open	None	T328921 Drop PHP 7.4 support from MediaWiki
Stalled	None	T334726 Use return type `never` in Wikibase
Open	None	T328922 Drop PHP 8.0 support from MediaWiki
Stalled	None	T319055 Upgrade to psr/container 2.x
Stalled	Krinkle	T319432 Migrate WMF production from PHP 7.4 to PHP 8.1
Open	None	T291916 Tracking task for Bullseye migrations in production
Stalled	None	T356293 Migrate MW appservers' base images to bullseye
Open	None	T290536 Serve production traffic via Kubernetes
Resolved	Clement_Goubert	T344814 mw-on-k8s tls-proxy container CPU throttling at low average load
Open	None	T354532 Limit the concurrency of envoy in service mesh
Open	Clement_Goubert	T345243 Remove tls-proxy cpu limits on eventstreams
In Progress	Clement_Goubert	T345244 Remove tls-proxy cpu limits on eventgate

Event Timeline

JMeybohm created this task.Jan 8 2024, 2:47 PM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptJan 8 2024, 2:47 PM

It seems to me that trying to respond to 1k rps with a concurrency of 2 is probably the issue. Throttling is bad because it raises latencies; if any measure we take to avoid throttling increases latencies compared to throttling, why bother?

In T354532#9441771, @Joe wrote:

It seems to me that trying to respond to 1k rps with a concurrency of 2 is probably the issue. Throttling is bad because it raises latencies; if any measure we take to avoid throttling increases latencies compared to throttling, why bother?

Sure, for this use case the values are not properly chosen (and that wasn't the point). But during the discussions we also had reservations that envoy could not sustain a high number of requests when being limited in terms of concurrency - which does not seem to be the case (at least not for out "usual" amount of req/s).

akosiaris subscribed.Jan 8 2024, 3:52 PM

Another thing we saw a significant change with different limits/concurrency in is envoy_cluster_upstream_cx_connect_ms (Connection establishment) - all while constantly serving ~100-200 req/s):