Page MenuHomePhabricator

Create a service-to-service proxy for handling HTTP calls from services to other entities
Closed, ResolvedPublic

Description

With the scalability issues we've been seeing on php-fpm when a lot of higher-latency http calls are involved, the necessity of having a proxy that can handle connections between services has become apparent.

More in general, we want to have a middleware that allows us to generically have the following capabilities, when dealing with RPC calls to other services:

  • Allow connection pooling
  • Work well with our DNS discovery mechanism
  • Enable TLS e2e without the need for relying on every single service doing encryption the "right" way
  • Allow configuring per-endpoint timeouts.
  • Global and local-only rate limiting
  • Allow monitoring RPC calls (telemetry and tracing)
  • Tracing of RPC calls

We've evaluated nginx in the past, and the non-commercial version lacks in even the most important of these features, as it can either support dns discovery or connection pooling, not both. We already use envoy as a TLS terminator on most servers, so we can probably use it to implement such a middleware, which is also what envoy was designed for.

We need to do what follows, for each service:

  • Add TLS termination
  • Add service proxy support

once that's done across all services, we can move, for each of them, through the following steps:

  • Add a TLS LVS endpoint
  • Switch the service proxy to use the TLS endpoint
  • Remove the HTTP LVS endpoint

Here is the current situation across the board:

servicetls terminationservice proxyTLS LVScleanup http LVS (optional)
mediawikixxxx
restbasexxxx
oresxxxx
blubberoidx-xx
citoidxxxx
echostorex-xx
sessionstorex-xx
termboxxxxx
push-notificationsxxx-
mobileappsxxxx
cxserverxxxx
apertiumx-xx
eventgate-analyticsx-xx
eventgate-analytics-externalx-xx
eventgate-logging-externalx-xx
eventgate-mainx-xx
eventstreamsx-xx
mathoidx-xx
protonx-xx
wikifeedsxxxx
zoterox-xx

Details

SubjectRepoBranchLines +/-
operations/puppetproduction+33 -34
operations/puppetproduction+261 -261
operations/puppetproduction+5 -51
operations/puppetproduction+1 -1
operations/puppetproduction+1 -1
operations/puppetproduction+2 -47
operations/puppetproduction+1 -1
operations/puppetproduction+1 -1
operations/puppetproduction+1 -1
operations/puppetproduction+10 -1
operations/puppetproduction+19 -52
operations/puppetproduction+1 -1
operations/mediawiki-configmaster+1 -1
operations/puppetproduction+1 -1
operations/puppetproduction+0 -36
operations/puppetproduction+1 -1
operations/deployment-chartsmaster+5 -5
operations/puppetproduction+42 -8
operations/puppetproduction+2 -3
operations/puppetproduction+1 -4
operations/puppetproduction+41 -2
operations/deployment-chartsmaster+6 -2
operations/deployment-chartsmaster+142 -51
operations/deployment-chartsmaster+34 -4
mediawiki/services/ores/deploymaster+90 -45
mediawiki/services/ores/deploymaster+3 -1
operations/deployment-chartsmaster+360 -14
operations/puppetproduction+6 -0
operations/puppetproduction+2 -2
operations/deployment-chartsmaster+269 -250
operations/deployment-chartsmaster+1 -1
operations/deployment-chartsmaster+2 -2
operations/deployment-chartsmaster+277 -0
operations/mediawiki-configmaster+1 -1
operations/mediawiki-configmaster+1 -1
operations/mediawiki-configmaster+4 -4
operations/mediawiki-configmaster+1 -1
operations/mediawiki-configmaster+2 -0
operations/mediawiki-configmaster+5 -1
operations/deployment-chartsmaster+4 -4
operations/deployment-chartsmaster+16 -5
operations/deployment-chartsmaster+267 -166
operations/mediawiki-configmaster+1 -1
operations/mediawiki-configmaster+1 -1
operations/deployment-chartsmaster+4 -4
operations/deployment-chartsmaster+1 -1
operations/mediawiki-configmaster+1 -1
operations/mediawiki-configmaster+1 -1
operations/mediawiki-configmaster+2 -2
operations/puppetproduction+6 -0
operations/puppetproduction+4 -0
operations/mediawiki-configmaster+6 -6
operations/puppetproduction+2 -8
operations/puppetproduction+2 -0
operations/puppetproduction+417 -0
operations/puppetproduction+87 -77
Show related patches Customize query in gerrit

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

Change 582792 merged by jenkins-bot:
[operations/deployment-charts@master] Add local service proxy to the tls terminator v0.2

https://gerrit.wikimedia.org/r/582792

Change 621206 had a related patch set uploaded (by Ladsgroup; owner: Ladsgroup):
[mediawiki/services/ores/deploy@master] Set testwiki API extractor to use the internal endpoint instead

https://gerrit.wikimedia.org/r/621206

Change 621206 merged by Ladsgroup:
[mediawiki/services/ores/deploy@master] Set testwiki API extractor to use the internal endpoint instead

https://gerrit.wikimedia.org/r/621206

Mentioned in SAL (#wikimedia-operations) [2020-08-20T12:44:24Z] <oblivian@deploy1001> Started deploy [ores/deploy@a208a0e]: switch testwiki to use envoy as a service proxy T244843

Mentioned in SAL (#wikimedia-operations) [2020-08-20T12:51:27Z] <oblivian@deploy1001> Finished deploy [ores/deploy@a208a0e]: switch testwiki to use envoy as a service proxy T244843 (duration: 07m 03s)

Mentioned in SAL (#wikimedia-operations) [2020-08-20T13:00:20Z] <oblivian@deploy1001> Started deploy [ores/deploy@a208a0e]: switch testwiki to use envoy as a service proxy T244843

Mentioned in SAL (#wikimedia-operations) [2020-08-20T13:11:38Z] <oblivian@deploy1001> Finished deploy [ores/deploy@a208a0e]: switch testwiki to use envoy as a service proxy T244843 (duration: 11m 19s)

Mentioned in SAL (#wikimedia-operations) [2020-08-20T13:14:41Z] <oblivian@deploy1001> Started deploy [ores/deploy@74677b6]: switch testwiki to use envoy as a service proxy T244843 (take 2)

Mentioned in SAL (#wikimedia-operations) [2020-08-20T13:26:18Z] <oblivian@deploy1001> Finished deploy [ores/deploy@74677b6]: switch testwiki to use envoy as a service proxy T244843 (take 2) (duration: 11m 37s)

Change 621522 had a related patch set uploaded (by Ladsgroup; owner: Ladsgroup):
[mediawiki/services/ores/deploy@master] Migrate rest of wikis to Envoy

https://gerrit.wikimedia.org/r/621522

Change 621522 merged by Ladsgroup:
[mediawiki/services/ores/deploy@master] Migrate rest of wikis to Envoy

https://gerrit.wikimedia.org/r/621522

Mentioned in SAL (#wikimedia-operations) [2020-08-20T13:39:14Z] <oblivian@deploy1001> Started deploy [ores/deploy@e860508]: switch everything to use envoy as a service proxy T244843

Mentioned in SAL (#wikimedia-operations) [2020-08-20T13:53:14Z] <oblivian@deploy1001> Finished deploy [ores/deploy@e860508]: switch everything to use envoy as a service proxy T244843 (duration: 14m 00s)

Change 622580 had a related patch set uploaded (by Giuseppe Lavagetto; owner: Giuseppe Lavagetto):
[operations/deployment-charts@master] termbox: switch to use envoy to call MediaWiki

https://gerrit.wikimedia.org/r/622580

Change 622580 merged by jenkins-bot:
[operations/deployment-charts@master] termbox: switch to use envoy to call MediaWiki

https://gerrit.wikimedia.org/r/622580

Change 624290 had a related patch set uploaded (by JMeybohm; owner: JMeybohm):
[operations/deployment-charts@master] Revert "Convert proton to the new layout"

https://gerrit.wikimedia.org/r/624290

Change 624290 merged by jenkins-bot:
[operations/deployment-charts@master] Revert "Convert proton to the new layout"

https://gerrit.wikimedia.org/r/624290

Change 625839 had a related patch set uploaded (by Giuseppe Lavagetto; owner: Giuseppe Lavagetto):
[operations/deployment-charts@master] default-network-policy: allow restbase HTTPS port

https://gerrit.wikimedia.org/r/625839

Change 625839 merged by jenkins-bot:
[operations/deployment-charts@master] default-network-policy: allow restbase HTTPS port

https://gerrit.wikimedia.org/r/625839

Change 628799 had a related patch set uploaded (by Giuseppe Lavagetto; owner: Giuseppe Lavagetto):
[operations/puppet@production] services: add TLS encrypted endpoint for ores (1/2)

https://gerrit.wikimedia.org/r/628799

Change 628800 had a related patch set uploaded (by Giuseppe Lavagetto; owner: Giuseppe Lavagetto):
[operations/puppet@production] services: add TLS encrypted endpoint for ores (2/2)

https://gerrit.wikimedia.org/r/628800

Change 628801 had a related patch set uploaded (by Giuseppe Lavagetto; owner: Giuseppe Lavagetto):
[operations/puppet@production] services: use TLS to connect to ORES

https://gerrit.wikimedia.org/r/628801

Change 628802 had a related patch set uploaded (by Giuseppe Lavagetto; owner: Giuseppe Lavagetto):
[operations/puppet@production] services: retire the ORES http endpoint (1/2)

https://gerrit.wikimedia.org/r/628802

Change 628803 had a related patch set uploaded (by Giuseppe Lavagetto; owner: Giuseppe Lavagetto):
[operations/puppet@production] services: retire the ORES http endpoint (2/2)

https://gerrit.wikimedia.org/r/628803

Change 628799 merged by Giuseppe Lavagetto:
[operations/puppet@production] services: add TLS encrypted endpoint for ores (1/2)

https://gerrit.wikimedia.org/r/628799

Change 628800 merged by Giuseppe Lavagetto:
[operations/puppet@production] services: add TLS encrypted endpoint for ores (2/2)

https://gerrit.wikimedia.org/r/628800

Change 628801 merged by Giuseppe Lavagetto:
[operations/puppet@production] services: use TLS to connect to ORES

https://gerrit.wikimedia.org/r/628801

Change 574988 abandoned by Giuseppe Lavagetto:
[operations/puppet@production] mediawiki::common: use envoy for tls termination too in nodes using it

Reason:
Superseded

https://gerrit.wikimedia.org/r/574988

Change 630537 had a related patch set uploaded (by Giuseppe Lavagetto; owner: Giuseppe Lavagetto):
[operations/deployment-charts@master] changeprop: use https to connect to ORES, restbase

https://gerrit.wikimedia.org/r/630537

Change 630537 abandoned by Giuseppe Lavagetto:
[operations/deployment-charts@master] changeprop: use https to connect to ORES, restbase

Reason:
Already merged elsewhere

https://gerrit.wikimedia.org/r/630537

Joe updated the task description. (Show Details)

Change 630562 had a related patch set uploaded (by Giuseppe Lavagetto; owner: Giuseppe Lavagetto):
[operations/puppet@production] service::configuration: connect to restbase via TLS

https://gerrit.wikimedia.org/r/630562

Change 628802 merged by JMeybohm:
[operations/puppet@production] services: retire the ORES http endpoint (1/2)

https://gerrit.wikimedia.org/r/628802

Change 628803 merged by JMeybohm:
[operations/puppet@production] services: retire the ORES http endpoint (2/2)

https://gerrit.wikimedia.org/r/628803

Mentioned in SAL (#wikimedia-operations) [2020-10-01T14:42:42Z] <jayme> running puppet on lvs servers - T244843 T255878

Mentioned in SAL (#wikimedia-operations) [2020-10-01T14:48:36Z] <jayme> restarting pybal on lvs2010.codfw.wmnet - T244843 T255878

Mentioned in SAL (#wikimedia-operations) [2020-10-01T14:50:02Z] <jayme> restarting pybal on lvs1015.eqiad.wmnet,lvs2009.codfw.wmnet - T244843 T255878

Mentioned in SAL (#wikimedia-operations) [2020-10-01T14:53:48Z] <jayme> running ipvsadm -D -t 10.2.1.10:8081; ipvsadm -D -t 10.2.1.47:8889 on lvs2010.codfw.wmnet,lvs2009.codfw.wmnet - T244843 T255878

Mentioned in SAL (#wikimedia-operations) [2020-10-01T14:55:43Z] <jayme> running ipvsadm -D -t 10.2.2.10:8081; ipvsadm -D -t 10.2.2.47:8889 on lvs1015.eqiad.wmnet - T244843 T255878

Change 630562 merged by Giuseppe Lavagetto:
[operations/puppet@production] service::configuration: connect to restbase via TLS

https://gerrit.wikimedia.org/r/630562

Change 578497 abandoned by Giuseppe Lavagetto:
[operations/mediawiki-config@master] Switch restbase to use envoy

Reason:

https://gerrit.wikimedia.org/r/578497

In order to catch calls to mediawiki that are not monitoring and go to port 80 directly, I'm basically checking the apache logs for requests without a request-id, for now on one appserver. The assumption here is that all the applications that use envoy in one form or another will just go directly to the TLS port.

I am running the following thing on mw1331:

$tail -f /var/log/apache2/other_vhosts_access.log | awk '{if ($(NF-1) == "-") { print $_ }}' | fgrep -v check_http/v2.2 | fgrep -v 'Twisted PageGetter' | fgrep -v server-status/ | fgrep -v wmf-icinga/check_etcd_mw_config_lastindex.py > no-reqid-apache.log

And I found we get a good amount of malformed requests that result in log lines like:

2021-02-17T23:43:21     35      10.64.32.33     -/400   226     GET     http://-/wiki/Coat_of_arms_of_Bern      -       text/html       -       -       -       -       -       -       -       10.64.32.33     - -

Such a request that has no data besides the url can be generated by sending manually a request *from mw1331 itself* as follows:

$ telnet mw1331 80
GET /wiki/Coat_of_arms_Bern HTTP/1.1

with no Host header. I wonder what is causing this.

Change 665089 had a related patch set uploaded (by Giuseppe Lavagetto; owner: Giuseppe Lavagetto):
[operations/puppet@production] services: remove restbase http LVS endpoint

https://gerrit.wikimedia.org/r/665089

Change 665090 had a related patch set uploaded (by Giuseppe Lavagetto; owner: Giuseppe Lavagetto):
[operations/puppet@production] restbase: remove references to the non-https LVS

https://gerrit.wikimedia.org/r/665090

Change 665089 merged by Giuseppe Lavagetto:
[operations/puppet@production] services: remove restbase http LVS endpoint

https://gerrit.wikimedia.org/r/665089

In order to catch calls to mediawiki that are not monitoring and go to port 80 directly, I'm basically checking the apache logs for requests without a request-id, for now on one appserver. The assumption here is that all the applications that use envoy in one form or another will just go directly to the TLS port.

I am running the following thing on mw1331:

$tail -f /var/log/apache2/other_vhosts_access.log | awk '{if ($(NF-1) == "-") { print $_ }}' | fgrep -v check_http/v2.2 | fgrep -v 'Twisted PageGetter' | fgrep -v server-status/ | fgrep -v wmf-icinga/check_etcd_mw_config_lastindex.py > no-reqid-apache.log

And I found we get a good amount of malformed requests that result in log lines like:

2021-02-17T23:43:21     35      10.64.32.33     -/400   226     GET     http://-/wiki/Coat_of_arms_of_Bern      -       text/html       -       -       -       -       -       -       -       10.64.32.33     - -

Such a request that has no data besides the url can be generated by sending manually a request *from mw1331 itself* as follows:

$ telnet mw1331 80
GET /wiki/Coat_of_arms_Bern HTTP/1.1

with no Host header. I wonder what is causing this.

I confirmed that such requests don't go through envoy, and they're all either:

  • completely empty log lines from requests from the load balancers, I suppose when IdleConnection fails
  • Requests from the same host, directly to apache

I determined the above trying to call envoy with the same kind of request, and envoy correctly rejects such requests before even trying to forward them to apache httpd:

cumin1001:~$ openssl s_client -connect mw1331.eqiad.wmnet:443
...
GET /testjoe HTTP/1.1


...

---
read R BLOCK
HTTP/1.1 400 Bad Request
Date: Mon, 22 Feb 2021 10:42:00 GMT
Server: envoy
Content-Length: 0

And no log line is found in apache.

I suspect there is some bug somewhere in our code causing this, but it's quite hard to pin what exactly is causing this. I'll think of ways to find it out (possibly logging any request not setting a Host header in Mediawiki itself, although I somehow doubt these requests use our standard libraries).

Another possibility is that this is a consequence of some bug in envoy causing such sporadic log lines. but at this point I feel like we don't need further investigation for this task - those are a few requests, none of which is correctly served, so we can ignore them for the scope of this task.

Change 665090 merged by Giuseppe Lavagetto:
[operations/puppet@production] restbase: remove references to the non-https LVS

https://gerrit.wikimedia.org/r/665090

Change 612461 abandoned by Giuseppe Lavagetto:
[operations/puppet@production] scb: add service proxy, use it in the applications.

Reason:
scb is no more!

https://gerrit.wikimedia.org/r/612461

Change 747098 had a related patch set uploaded (by Hnowlan; author: Hnowlan):

[operations/puppet@production] conftool: clean up references to obsolete restbase service

https://gerrit.wikimedia.org/r/747098

Change 766571 had a related patch set uploaded (by Giuseppe Lavagetto; author: Giuseppe Lavagetto):

[operations/puppet@production] mx: use https when connecting to the mw api

https://gerrit.wikimedia.org/r/766571

Change 766572 had a related patch set uploaded (by Giuseppe Lavagetto; author: Giuseppe Lavagetto):

[operations/puppet@production] api: remove monitoring from http endpoint

https://gerrit.wikimedia.org/r/766572

Change 766573 had a related patch set uploaded (by Giuseppe Lavagetto; author: Giuseppe Lavagetto):

[operations/puppet@production] api: remove http endpoint from pybal

https://gerrit.wikimedia.org/r/766573

Change 766574 had a related patch set uploaded (by Giuseppe Lavagetto; author: Giuseppe Lavagetto):

[operations/puppet@production] api: remove non-https endpoint from backends

https://gerrit.wikimedia.org/r/766574

Change 766575 had a related patch set uploaded (by Giuseppe Lavagetto; author: Giuseppe Lavagetto):

[operations/puppet@production] appservers: remove monitoring for http-only

https://gerrit.wikimedia.org/r/766575

Change 766576 had a related patch set uploaded (by Giuseppe Lavagetto; author: Giuseppe Lavagetto):

[operations/puppet@production] appserver: remove unencrypted LVS endpoint

https://gerrit.wikimedia.org/r/766576

Change 766577 had a related patch set uploaded (by Giuseppe Lavagetto; author: Giuseppe Lavagetto):

[operations/puppet@production] appserver: remove http pool from backends

https://gerrit.wikimedia.org/r/766577

Change 766578 had a related patch set uploaded (by Giuseppe Lavagetto; author: Giuseppe Lavagetto):

[operations/puppet@production] conftool: remove http pools for mediawiki

https://gerrit.wikimedia.org/r/766578

Change 766571 merged by Giuseppe Lavagetto:

[operations/puppet@production] mx: use https when connecting to the mw api

https://gerrit.wikimedia.org/r/766571

Change 766572 merged by Giuseppe Lavagetto:

[operations/puppet@production] api: remove monitoring from http endpoint

https://gerrit.wikimedia.org/r/766572

Change 766573 merged by Giuseppe Lavagetto:

[operations/puppet@production] api: remove http endpoint from pybal

https://gerrit.wikimedia.org/r/766573

Change 766574 merged by Giuseppe Lavagetto:

[operations/puppet@production] api: remove non-https endpoint from backends

https://gerrit.wikimedia.org/r/766574

Change 766575 merged by Giuseppe Lavagetto:

[operations/puppet@production] appservers: remove monitoring for http-only

https://gerrit.wikimedia.org/r/766575

Change 766576 merged by Giuseppe Lavagetto:

[operations/puppet@production] appserver: remove unencrypted LVS endpoint

https://gerrit.wikimedia.org/r/766576

Mentioned in SAL (#wikimedia-operations) [2022-03-01T11:18:21Z] <_joe_> also removed the ipvsadm entry for apaches:80 T244843

Mentioned in SAL (#wikimedia-operations) [2022-03-01T11:21:46Z] <_joe_> restarted pybal, removed ipvsadm entry on lvs1019. Now all of MediaWiki has no http LVS endpoint available.T244843

Joe updated the task description. (Show Details)

Change 766577 merged by Giuseppe Lavagetto:

[operations/puppet@production] appserver: remove http pool from backends

https://gerrit.wikimedia.org/r/766577

Change 766578 merged by Giuseppe Lavagetto:

[operations/puppet@production] conftool: remove http pools for mediawiki

https://gerrit.wikimedia.org/r/766578

Change 747098 abandoned by Hnowlan:

[operations/puppet@production] conftool: clean up references to obsolete restbase service

Reason:

https://gerrit.wikimedia.org/r/747098