Page MenuHomePhabricator

Create a service-to-service proxy for handling HTTP calls from services to other entities
Open, HighPublic

Description

With the scalability issues we've been seeing on php-fpm when a lot of higher-latency http calls are involved, the necessity of having a proxy that can handle connections between services has become apparent.

More in general, we want to have a middleware that allows us to generically have the following capabilities, when dealing with RPC calls to other services:

  • Allow connection pooling
  • Work well with our DNS discovery mechanism
  • Enable TLS e2e without the need for relying on every single service doing encryption the "right" way
  • Allow configuring per-endpoint timeouts.
  • Global and local-only rate limiting
  • Allow monitoring RPC calls (telemetry and tracing)
  • Tracing of RPC calls

We've evaluated nginx in the past, and the non-commercial version lacks in even the most important of these features, as it can either support dns discovery or connection pooling, not both. We already use envoy as a TLS terminator on most servers, so we can probably use it to implement such a middleware, which is also what envoy was designed for.

We need to do what follows, for each service:

  • Add TLS termination
  • Add service proxy support

once that's done across all services, we can move, for each of them, through the following steps:

  • Add a TLS LVS endpoint
  • Switch the service proxy to use the TLS endpoint
  • Remove the HTTP LVS endpoint

Here is the current situation across the board:

servicetls terminationservice proxyTLS LVScleanup http LVS (optional)
mediawikixxx
restbasexxxx
oresxxxx
blubberoidx-xx
citoidxxxx
echostorex-xx
sessionstorex-xx
termboxxxxx
push-notificationsxxx-
mobileappsxxxx
cxserverxxxx
apertiumx-xx
eventgate-analyticsx-xx
eventgate-analytics-externalx-xx
eventgate-logging-externalx-xx
eventgate-mainx-xx
eventstreamsx-xx
mathoidx-xx
protonx-xx
wikifeedsxxxx
zoterox-xx

Details

ProjectBranchLines +/-Subject
operations/puppetproduction+10 -1
operations/puppetproduction+19 -52
operations/puppetproduction+1 -1
operations/mediawiki-configmaster+1 -1
operations/puppetproduction+1 -1
operations/puppetproduction+0 -36
operations/puppetproduction+1 -1
operations/deployment-chartsmaster+5 -5
operations/puppetproduction+42 -8
operations/puppetproduction+2 -3
operations/puppetproduction+1 -4
operations/puppetproduction+41 -2
operations/deployment-chartsmaster+6 -2
operations/deployment-chartsmaster+142 -51
operations/deployment-chartsmaster+34 -4
mediawiki/services/ores/deploymaster+90 -45
mediawiki/services/ores/deploymaster+3 -1
operations/deployment-chartsmaster+360 -14
operations/puppetproduction+6 -0
operations/puppetproduction+2 -2
operations/deployment-chartsmaster+269 -250
operations/deployment-chartsmaster+1 -1
operations/deployment-chartsmaster+2 -2
operations/deployment-chartsmaster+277 -0
operations/mediawiki-configmaster+1 -1
operations/mediawiki-configmaster+1 -1
operations/mediawiki-configmaster+4 -4
operations/mediawiki-configmaster+1 -1
operations/mediawiki-configmaster+2 -0
operations/mediawiki-configmaster+5 -1
operations/deployment-chartsmaster+4 -4
operations/deployment-chartsmaster+16 -5
operations/deployment-chartsmaster+267 -166
operations/mediawiki-configmaster+1 -1
operations/mediawiki-configmaster+1 -1
operations/deployment-chartsmaster+4 -4
operations/deployment-chartsmaster+1 -1
operations/mediawiki-configmaster+1 -1
operations/mediawiki-configmaster+1 -1
operations/mediawiki-configmaster+2 -2
operations/puppetproduction+6 -0
operations/puppetproduction+4 -0
operations/mediawiki-configmaster+6 -6
operations/puppetproduction+2 -8
operations/puppetproduction+2 -0
operations/puppetproduction+417 -0
operations/puppetproduction+87 -77
Show related patches Customize query in gerrit

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

Change 578495 merged by jenkins-bot:
[operations/mediawiki-config@master] wdqs-internal: switch to use envoy

https://gerrit.wikimedia.org/r/578495

@Joe @akosiaris all deployments of eventgate and eventstreams have been updated to use tls.resources etc.

Thanks a lot! I hope to get to switch eventgate-analytics to TLS today then!

Change 578496 merged by jenkins-bot:
[operations/mediawiki-config@master] Switch ores to use envoy

https://gerrit.wikimedia.org/r/578496

Change 582777 had a related patch set uploaded (by Giuseppe Lavagetto; owner: Giuseppe Lavagetto):
[operations/deployment-charts@master] Make configuration of envoy a ConfigMap

https://gerrit.wikimedia.org/r/582777

Change 582792 had a related patch set uploaded (by Giuseppe Lavagetto; owner: Giuseppe Lavagetto):
[operations/deployment-charts@master] Add local service proxy to the tls terminator v0.2

https://gerrit.wikimedia.org/r/582792

Change 576009 merged by jenkins-bot:
[operations/mediawiki-config@master] ProductionServices: switch eventgate-main to use envoy

https://gerrit.wikimedia.org/r/576009

Mentioned in SAL (#wikimedia-operations) [2020-03-26T12:57:13Z] <oblivian@deploy1001> Synchronized wmf-config/ProductionServices.php: eventgate-main to use envoy T244843 (duration: 01m 07s)

Change 582777 merged by jenkins-bot:
[operations/deployment-charts@master] Make configuration of envoy a ConfigMap

https://gerrit.wikimedia.org/r/582777

Change 597229 had a related patch set uploaded (by JMeybohm; owner: JMeybohm):
[operations/deployment-charts@master] tls_helper: qoute idle_timeout default value

https://gerrit.wikimedia.org/r/597229

Change 597229 merged by jenkins-bot:
[operations/deployment-charts@master] tls_helper: qoute idle_timeout default value

https://gerrit.wikimedia.org/r/597229

Change 597240 had a related patch set uploaded (by JMeybohm; owner: JMeybohm):
[operations/deployment-charts@master] tls_helper: fix typo in template reference

https://gerrit.wikimedia.org/r/597240

Change 597240 merged by jenkins-bot:
[operations/deployment-charts@master] tls_helper: fix typo in template reference

https://gerrit.wikimedia.org/r/597240

Change 597303 had a related patch set uploaded (by JMeybohm; owner: JMeybohm):
[operations/deployment-charts@master] tls_helper: fix the envoy config configmap

https://gerrit.wikimedia.org/r/597303

Change 597303 merged by jenkins-bot:
[operations/deployment-charts@master] tls_helper: fix the envoy config configmap

https://gerrit.wikimedia.org/r/597303

Change 612461 had a related patch set uploaded (by Giuseppe Lavagetto; owner: Giuseppe Lavagetto):
[operations/puppet@production] scb: add service proxy, use it in the applications.

https://gerrit.wikimedia.org/r/612461

Change 612462 had a related patch set uploaded (by Giuseppe Lavagetto; owner: Giuseppe Lavagetto):
[operations/puppet@production] maps: add the service proxy

https://gerrit.wikimedia.org/r/612462

Change 612463 had a related patch set uploaded (by Giuseppe Lavagetto; owner: Giuseppe Lavagetto):
[operations/puppet@production] maps: use the service proxy to connect to wdqs

https://gerrit.wikimedia.org/r/612463

Change 582792 merged by jenkins-bot:
[operations/deployment-charts@master] Add local service proxy to the tls terminator v0.2

https://gerrit.wikimedia.org/r/582792

Change 621206 had a related patch set uploaded (by Ladsgroup; owner: Ladsgroup):
[mediawiki/services/ores/deploy@master] Set testwiki API extractor to use the internal endpoint instead

https://gerrit.wikimedia.org/r/621206

Change 621206 merged by Ladsgroup:
[mediawiki/services/ores/deploy@master] Set testwiki API extractor to use the internal endpoint instead

https://gerrit.wikimedia.org/r/621206

Mentioned in SAL (#wikimedia-operations) [2020-08-20T12:44:24Z] <oblivian@deploy1001> Started deploy [ores/deploy@a208a0e]: switch testwiki to use envoy as a service proxy T244843

Mentioned in SAL (#wikimedia-operations) [2020-08-20T12:51:27Z] <oblivian@deploy1001> Finished deploy [ores/deploy@a208a0e]: switch testwiki to use envoy as a service proxy T244843 (duration: 07m 03s)

Mentioned in SAL (#wikimedia-operations) [2020-08-20T13:00:20Z] <oblivian@deploy1001> Started deploy [ores/deploy@a208a0e]: switch testwiki to use envoy as a service proxy T244843

Mentioned in SAL (#wikimedia-operations) [2020-08-20T13:11:38Z] <oblivian@deploy1001> Finished deploy [ores/deploy@a208a0e]: switch testwiki to use envoy as a service proxy T244843 (duration: 11m 19s)

Mentioned in SAL (#wikimedia-operations) [2020-08-20T13:14:41Z] <oblivian@deploy1001> Started deploy [ores/deploy@74677b6]: switch testwiki to use envoy as a service proxy T244843 (take 2)

Mentioned in SAL (#wikimedia-operations) [2020-08-20T13:26:18Z] <oblivian@deploy1001> Finished deploy [ores/deploy@74677b6]: switch testwiki to use envoy as a service proxy T244843 (take 2) (duration: 11m 37s)

Change 621522 had a related patch set uploaded (by Ladsgroup; owner: Ladsgroup):
[mediawiki/services/ores/deploy@master] Migrate rest of wikis to Envoy

https://gerrit.wikimedia.org/r/621522

Change 621522 merged by Ladsgroup:
[mediawiki/services/ores/deploy@master] Migrate rest of wikis to Envoy

https://gerrit.wikimedia.org/r/621522

Mentioned in SAL (#wikimedia-operations) [2020-08-20T13:39:14Z] <oblivian@deploy1001> Started deploy [ores/deploy@e860508]: switch everything to use envoy as a service proxy T244843

Mentioned in SAL (#wikimedia-operations) [2020-08-20T13:53:14Z] <oblivian@deploy1001> Finished deploy [ores/deploy@e860508]: switch everything to use envoy as a service proxy T244843 (duration: 14m 00s)

Change 622580 had a related patch set uploaded (by Giuseppe Lavagetto; owner: Giuseppe Lavagetto):
[operations/deployment-charts@master] termbox: switch to use envoy to call MediaWiki

https://gerrit.wikimedia.org/r/622580

Change 622580 merged by jenkins-bot:
[operations/deployment-charts@master] termbox: switch to use envoy to call MediaWiki

https://gerrit.wikimedia.org/r/622580

Change 624290 had a related patch set uploaded (by JMeybohm; owner: JMeybohm):
[operations/deployment-charts@master] Revert "Convert proton to the new layout"

https://gerrit.wikimedia.org/r/624290

Change 624290 merged by jenkins-bot:
[operations/deployment-charts@master] Revert "Convert proton to the new layout"

https://gerrit.wikimedia.org/r/624290

Change 625839 had a related patch set uploaded (by Giuseppe Lavagetto; owner: Giuseppe Lavagetto):
[operations/deployment-charts@master] default-network-policy: allow restbase HTTPS port

https://gerrit.wikimedia.org/r/625839

Change 625839 merged by jenkins-bot:
[operations/deployment-charts@master] default-network-policy: allow restbase HTTPS port

https://gerrit.wikimedia.org/r/625839

Change 628799 had a related patch set uploaded (by Giuseppe Lavagetto; owner: Giuseppe Lavagetto):
[operations/puppet@production] services: add TLS encrypted endpoint for ores (1/2)

https://gerrit.wikimedia.org/r/628799

Change 628800 had a related patch set uploaded (by Giuseppe Lavagetto; owner: Giuseppe Lavagetto):
[operations/puppet@production] services: add TLS encrypted endpoint for ores (2/2)

https://gerrit.wikimedia.org/r/628800

Change 628801 had a related patch set uploaded (by Giuseppe Lavagetto; owner: Giuseppe Lavagetto):
[operations/puppet@production] services: use TLS to connect to ORES

https://gerrit.wikimedia.org/r/628801

Change 628802 had a related patch set uploaded (by Giuseppe Lavagetto; owner: Giuseppe Lavagetto):
[operations/puppet@production] services: retire the ORES http endpoint (1/2)

https://gerrit.wikimedia.org/r/628802

Change 628803 had a related patch set uploaded (by Giuseppe Lavagetto; owner: Giuseppe Lavagetto):
[operations/puppet@production] services: retire the ORES http endpoint (2/2)

https://gerrit.wikimedia.org/r/628803

Change 628799 merged by Giuseppe Lavagetto:
[operations/puppet@production] services: add TLS encrypted endpoint for ores (1/2)

https://gerrit.wikimedia.org/r/628799

Change 628800 merged by Giuseppe Lavagetto:
[operations/puppet@production] services: add TLS encrypted endpoint for ores (2/2)

https://gerrit.wikimedia.org/r/628800

Change 628801 merged by Giuseppe Lavagetto:
[operations/puppet@production] services: use TLS to connect to ORES

https://gerrit.wikimedia.org/r/628801

Change 574988 abandoned by Giuseppe Lavagetto:
[operations/puppet@production] mediawiki::common: use envoy for tls termination too in nodes using it

Reason:
Superseded

https://gerrit.wikimedia.org/r/574988

Change 630537 had a related patch set uploaded (by Giuseppe Lavagetto; owner: Giuseppe Lavagetto):
[operations/deployment-charts@master] changeprop: use https to connect to ORES, restbase

https://gerrit.wikimedia.org/r/630537

Change 630537 abandoned by Giuseppe Lavagetto:
[operations/deployment-charts@master] changeprop: use https to connect to ORES, restbase

Reason:
Already merged elsewhere

https://gerrit.wikimedia.org/r/630537

Joe updated the task description. (Show Details)

Change 630562 had a related patch set uploaded (by Giuseppe Lavagetto; owner: Giuseppe Lavagetto):
[operations/puppet@production] service::configuration: connect to restbase via TLS

https://gerrit.wikimedia.org/r/630562

Change 628802 merged by JMeybohm:
[operations/puppet@production] services: retire the ORES http endpoint (1/2)

https://gerrit.wikimedia.org/r/628802

Change 628803 merged by JMeybohm:
[operations/puppet@production] services: retire the ORES http endpoint (2/2)

https://gerrit.wikimedia.org/r/628803

Mentioned in SAL (#wikimedia-operations) [2020-10-01T14:42:42Z] <jayme> running puppet on lvs servers - T244843 T255878

Mentioned in SAL (#wikimedia-operations) [2020-10-01T14:48:36Z] <jayme> restarting pybal on lvs2010.codfw.wmnet - T244843 T255878

Mentioned in SAL (#wikimedia-operations) [2020-10-01T14:50:02Z] <jayme> restarting pybal on lvs1015.eqiad.wmnet,lvs2009.codfw.wmnet - T244843 T255878

Mentioned in SAL (#wikimedia-operations) [2020-10-01T14:53:48Z] <jayme> running ipvsadm -D -t 10.2.1.10:8081; ipvsadm -D -t 10.2.1.47:8889 on lvs2010.codfw.wmnet,lvs2009.codfw.wmnet - T244843 T255878

Mentioned in SAL (#wikimedia-operations) [2020-10-01T14:55:43Z] <jayme> running ipvsadm -D -t 10.2.2.10:8081; ipvsadm -D -t 10.2.2.47:8889 on lvs1015.eqiad.wmnet - T244843 T255878

Change 630562 merged by Giuseppe Lavagetto:
[operations/puppet@production] service::configuration: connect to restbase via TLS

https://gerrit.wikimedia.org/r/630562

Change 578497 abandoned by Giuseppe Lavagetto:
[operations/mediawiki-config@master] Switch restbase to use envoy

Reason:

https://gerrit.wikimedia.org/r/578497

In order to catch calls to mediawiki that are not monitoring and go to port 80 directly, I'm basically checking the apache logs for requests without a request-id, for now on one appserver. The assumption here is that all the applications that use envoy in one form or another will just go directly to the TLS port.

I am running the following thing on mw1331:

$tail -f /var/log/apache2/other_vhosts_access.log | awk '{if ($(NF-1) == "-") { print $_ }}' | fgrep -v check_http/v2.2 | fgrep -v 'Twisted PageGetter' | fgrep -v server-status/ | fgrep -v wmf-icinga/check_etcd_mw_config_lastindex.py > no-reqid-apache.log

And I found we get a good amount of malformed requests that result in log lines like:

2021-02-17T23:43:21     35      10.64.32.33     -/400   226     GET     http://-/wiki/Coat_of_arms_of_Bern      -       text/html       -       -       -       -       -       -       -       10.64.32.33     - -

Such a request that has no data besides the url can be generated by sending manually a request *from mw1331 itself* as follows:

$ telnet mw1331 80
GET /wiki/Coat_of_arms_Bern HTTP/1.1

with no Host header. I wonder what is causing this.

Change 665089 had a related patch set uploaded (by Giuseppe Lavagetto; owner: Giuseppe Lavagetto):
[operations/puppet@production] services: remove restbase http LVS endpoint

https://gerrit.wikimedia.org/r/665089

Change 665090 had a related patch set uploaded (by Giuseppe Lavagetto; owner: Giuseppe Lavagetto):
[operations/puppet@production] restbase: remove references to the non-https LVS

https://gerrit.wikimedia.org/r/665090

Change 665089 merged by Giuseppe Lavagetto:
[operations/puppet@production] services: remove restbase http LVS endpoint

https://gerrit.wikimedia.org/r/665089

In order to catch calls to mediawiki that are not monitoring and go to port 80 directly, I'm basically checking the apache logs for requests without a request-id, for now on one appserver. The assumption here is that all the applications that use envoy in one form or another will just go directly to the TLS port.

I am running the following thing on mw1331:

$tail -f /var/log/apache2/other_vhosts_access.log | awk '{if ($(NF-1) == "-") { print $_ }}' | fgrep -v check_http/v2.2 | fgrep -v 'Twisted PageGetter' | fgrep -v server-status/ | fgrep -v wmf-icinga/check_etcd_mw_config_lastindex.py > no-reqid-apache.log

And I found we get a good amount of malformed requests that result in log lines like:

2021-02-17T23:43:21     35      10.64.32.33     -/400   226     GET     http://-/wiki/Coat_of_arms_of_Bern      -       text/html       -       -       -       -       -       -       -       10.64.32.33     - -

Such a request that has no data besides the url can be generated by sending manually a request *from mw1331 itself* as follows:

$ telnet mw1331 80
GET /wiki/Coat_of_arms_Bern HTTP/1.1

with no Host header. I wonder what is causing this.

I confirmed that such requests don't go through envoy, and they're all either:

  • completely empty log lines from requests from the load balancers, I suppose when IdleConnection fails
  • Requests from the same host, directly to apache

I determined the above trying to call envoy with the same kind of request, and envoy correctly rejects such requests before even trying to forward them to apache httpd:

cumin1001:~$ openssl s_client -connect mw1331.eqiad.wmnet:443
...
GET /testjoe HTTP/1.1


...

---
read R BLOCK
HTTP/1.1 400 Bad Request
Date: Mon, 22 Feb 2021 10:42:00 GMT
Server: envoy
Content-Length: 0

And no log line is found in apache.

I suspect there is some bug somewhere in our code causing this, but it's quite hard to pin what exactly is causing this. I'll think of ways to find it out (possibly logging any request not setting a Host header in Mediawiki itself, although I somehow doubt these requests use our standard libraries).

Another possibility is that this is a consequence of some bug in envoy causing such sporadic log lines. but at this point I feel like we don't need further investigation for this task - those are a few requests, none of which is correctly served, so we can ignore them for the scope of this task.

Change 665090 merged by Giuseppe Lavagetto:
[operations/puppet@production] restbase: remove references to the non-https LVS

https://gerrit.wikimedia.org/r/665090

Change 612461 abandoned by Giuseppe Lavagetto:
[operations/puppet@production] scb: add service proxy, use it in the applications.

Reason:
scb is no more!

https://gerrit.wikimedia.org/r/612461