Page MenuHomePhabricator

Using envoy to connect from MediaWiki to restbase causes an explosion of live LVS connections.
Closed, ResolvedPublic

Description

Yesterday we activated using envoy to connect to restbase from MediaWiki.

For some reason, the backend connections from LVS went from ~ 100 per backend in normal conditions to 2000 per backend, and rising.

Everything was ok on every side of the change, but for some reason I could see envoy keeping 100s of connections active to the upstream envoy on restbase.

We need to figure out why this is happening, but at the same time, we have three relatively simple ways out of this:

  • Just switch mediawiki to use https for the calls to restbase directly, without envoy
  • Reduce radically the life of a connection on the mw envoy side. Like reducing the idle timeout for a connection to 1 second
  • Start working on using file-based xDS and move away from using LVS between services, by figuring out how to define the upstream cluster.

I'm quite interested on working on the last option, but that's a sizeable amount of work to do. I'd like someone to spend time figuring out why this thing is happening first.

Event Timeline

Joe triaged this task as High priority.Oct 30 2020, 9:34 AM
Joe added a subscriber: JMeybohm.

Mentioned in SAL (#wikimedia-operations) [2021-02-17T10:13:23Z] <_joe_> depooling mw1331 to perform some tests for T266855

First observation I can make is that most requests are done by the math extension, and usually go in pairs:

  • A POST request to /v1/media/math/check/tex
  • A GET request for /v1/media/math/render/mml/<hash>

Large pages with a lot of math formulas can require up to ~ 100s of such urls in parallel (AIUI, will need to confirm from the code), so envoy allocates 100s of connections to accomodate for those.

Somehow though those connections never get closed because of idle timeout or anything.

A solution that surely works is to reduce the connections to be non-permanent, which is non-ideal, though.

Same effect was observed by adding an idle_timeout (which we don't usually add here, but has in practice more or less the same effect as removing persistent connections).

I am tempted to solve the issue "temporarily" by just adding an idle_timeout for now, and delegate the next level of fixes to future work introducing circuit breaking, or to let the problem solve itself if we finally manage to get away from using static resources and load-balancers.

Change 664791 had a related patch set uploaded (by Giuseppe Lavagetto; owner: Giuseppe Lavagetto):
[operations/puppet@production] P:services_proxy::envoy: add keepalive to restbase-https

https://gerrit.wikimedia.org/r/664791

Change 664655 had a related patch set uploaded (by Giuseppe Lavagetto; owner: Giuseppe Lavagetto):
[operations/mediawiki-config@master] Revert "Revert "Switch restbase calls to be channeled via envoy""

https://gerrit.wikimedia.org/r/664655

Change 664791 merged by Giuseppe Lavagetto:
[operations/puppet@production] P:services_proxy::envoy: add keepalive to restbase-https

https://gerrit.wikimedia.org/r/664791

Change 664655 merged by jenkins-bot:
[operations/mediawiki-config@master] Revert "Revert "Switch restbase calls to be channeled via envoy""

https://gerrit.wikimedia.org/r/664655