Yesterday we activated using envoy to connect to restbase from MediaWiki.
For some reason, the backend connections from LVS went from ~ 100 per backend in normal conditions to 2000 per backend, and rising.
Everything was ok on every side of the change, but for some reason I could see envoy keeping 100s of connections active to the upstream envoy on restbase.
We need to figure out why this is happening, but at the same time, we have three relatively simple ways out of this:
- Just switch mediawiki to use https for the calls to restbase directly, without envoy
- Reduce radically the life of a connection on the mw envoy side. Like reducing the idle timeout for a connection to 1 second
- Start working on using file-based xDS and move away from using LVS between services, by figuring out how to define the upstream cluster.
I'm quite interested on working on the last option, but that's a sizeable amount of work to do. I'd like someone to spend time figuring out why this thing is happening first.