Page MenuHomePhabricator

Limit the concurrency of envoy in service mesh
Open, Needs TriagePublic

Description

From T344814: mw-on-k8s tls-proxy container CPU throttling at low average load we've learned that removing limits from the service mesh while setting a fixed envoy concurrency (of 12) will remove throttling (obviously) without causing run away CPU usage of envoy.

We do have other services potentially suffering from envoy throttling (T345243, T345244, T353460) where we might not want to remove limits altogether. According to research (https://wikitech.wikimedia.org/wiki/Kubernetes/Resource_requests_and_limits#envoy) istio improves this by generally setting envoy concurrency more suited to the actual CPU limit set for the container (as max(ceil(<cpu-limit-in-whole-cpus>), 2)).

In T353460 we did a small test running envoy with 500m CPU limit with default concurrency (first spike) and with concurrency set to 2 (second spike):

Screenshot_20240108_153039.png (886×1 px, 139 KB)

While this paints a pretty clear picture in terms of throttling it seems to have also increased latency quite significantly (as in more than doubled it). Unfortunately there is a big spread in latency (depending on the type of request) at the upstream side as well, so it can't be said for sure hot big envoys role is from the current data.

Event Timeline

It seems to me that trying to respond to 1k rps with a concurrency of 2 is probably the issue. Throttling is bad because it raises latencies; if any measure we take to avoid throttling increases latencies compared to throttling, why bother?

It seems to me that trying to respond to 1k rps with a concurrency of 2 is probably the issue. Throttling is bad because it raises latencies; if any measure we take to avoid throttling increases latencies compared to throttling, why bother?

Sure, for this use case the values are not properly chosen (and that wasn't the point). But during the discussions we also had reservations that envoy could not sustain a high number of requests when being limited in terms of concurrency - which does not seem to be the case (at least not for out "usual" amount of req/s).

Another thing we saw a significant change with different limits/concurrency in is envoy_cluster_upstream_cx_connect_ms (Connection establishment) - all while constantly serving ~100-200 req/s):

Screenshot_20240112_092336.png (397×677 px, 103 KB)

Change #1015278 had a related patch set uploaded (by Clément Goubert; author: Clément Goubert):

[operations/deployment-charts@master] mw-api-int: Double envoy concurrency

https://gerrit.wikimedia.org/r/1015278

Change #1015278 merged by jenkins-bot:

[operations/deployment-charts@master] mw-api-int: Double envoy concurrency

https://gerrit.wikimedia.org/r/1015278