Page MenuHomePhabricator

Create a service-to-service proxy for handling HTTP calls from services to other entities
Open, HighPublic

Description

With the scalability issues we've been seeing on php-fpm when a lot of higher-latency http calls are involved, the necessity of having a proxy that can handle connections between services has become apparent.

More in general, we want to have a middleware that allows us to generically have the following capabilities, when dealing with RPC calls to other services:

  • Allow connection pooling
  • Work well with our DNS discovery mechanism
  • Enable TLS e2e without the need for relying on every single service doing encryption the "right" way
  • Allow configuring per-endpoint timeouts.
  • Global and local-only rate limiting
  • Allow monitoring RPC calls (telemetry and tracing)
  • Tracing of RPC calls

We've evaluated nginx in the past, and the non-commercial version lacks in even the most important of these features, as it can either support dns discovery or connection pooling, not both. We already use envoy as a TLS terminator on most servers, so we can probably use it to implement such a middleware, which is also what envoy was designed for.

We need to do what follows, for each service:

  • Add TLS termination
  • Add service proxy support

once that's done across all services, we can move, for each of them, through the following steps:

  • Add a TLS LVS endpoint
  • Switch the service proxy to use the TLS endpoint
  • Remove the HTTP LVS endpoint

Here is the current situation across the board:

servicetls terminationservice proxyTLS LVScleanup http LVS (optional)
mediawikixxx
restbasexxx
oresxxxx
blubberoidx-xx
citoidxxxx
echostorex-xx
sessionstorex-xx
termboxxxxx
push-notificationsxxx-
mobileappsxxxx
cxserverxxxx
eventgate-analyticsx-xx
eventgate-analytics-externalx-xx
eventgate-logging-externalx-xx
eventgate-mainx-xx
eventstreamsx-xx
mathoidx-xx
protonx-xx
wikifeedsxxxx
zoterox-xx

Details

ProjectBranchLines +/-Subject
operations/mediawiki-configmaster+1 -1
operations/puppetproduction+1 -1
operations/puppetproduction+0 -36
operations/puppetproduction+1 -1
operations/deployment-chartsmaster+5 -5
operations/puppetproduction+42 -8
operations/puppetproduction+2 -3
operations/puppetproduction+1 -4
operations/puppetproduction+41 -2
operations/deployment-chartsmaster+6 -2
operations/deployment-chartsmaster+142 -51
operations/deployment-chartsmaster+34 -4
mediawiki/services/ores/deploymaster+90 -45
mediawiki/services/ores/deploymaster+3 -1
operations/deployment-chartsmaster+360 -14
operations/puppetproduction+10 -1
operations/puppetproduction+6 -0
operations/puppetproduction+2 -2
operations/deployment-chartsmaster+269 -250
operations/deployment-chartsmaster+1 -1
operations/deployment-chartsmaster+2 -2
operations/deployment-chartsmaster+277 -0
operations/mediawiki-configmaster+1 -1
operations/mediawiki-configmaster+1 -1
operations/mediawiki-configmaster+4 -4
operations/mediawiki-configmaster+1 -1
operations/mediawiki-configmaster+2 -0
operations/mediawiki-configmaster+5 -1
operations/deployment-chartsmaster+4 -4
operations/deployment-chartsmaster+16 -5
operations/deployment-chartsmaster+267 -166
operations/mediawiki-configmaster+1 -1
operations/mediawiki-configmaster+1 -1
operations/deployment-chartsmaster+4 -4
operations/deployment-chartsmaster+1 -1
operations/mediawiki-configmaster+1 -1
operations/mediawiki-configmaster+1 -1
operations/mediawiki-configmaster+2 -2
operations/puppetproduction+6 -0
operations/puppetproduction+4 -0
operations/mediawiki-configmaster+6 -6
operations/puppetproduction+2 -8
operations/puppetproduction+2 -0
operations/puppetproduction+417 -0
operations/puppetproduction+87 -77
Show related patches Customize query in gerrit

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

Change 578503 merged by jenkins-bot:
[operations/deployment-charts@master] Bump up memory limits for echostore

https://gerrit.wikimedia.org/r/578503

@akosiaris Hm, yes, let's try! We are going to have issues with changes I made as part of T242861: Clarify multi-service instance concepts in helm charts and enable canary releases. The two main changes are:

  1. Every resource gets at least the following labels:
labels:
  chart: {{ template "wmf.chartname" . }}   # eventgate
  app: {{ .Values.main_app.name }}          # eventgate-main
  release: {{ .Release.Name }}              # production or canary
  1. The Service uses the routing_tag label as a selector to select which pods it should route to. The Deployment pod template has labels:
labels:
  chart: {{ template "wmf.chartname" . }}
  app: {{ .Values.main_app.name }}
  release: {{ .Release.Name }}
  routing_tag: {{ .Values.service.routing_tag | default .Release.Name }}

(Hm, it looks like I lost using wmf.chartid as a label somewhere, happy to add that back in somehow.
I think to do this we have finish resolving T242861 for all charts, not just eventgate and eventstreams.

Change 578492 merged by jenkins-bot:
[operations/mediawiki-config@master] Use Envoy to talk to echostore

https://gerrit.wikimedia.org/r/578492

Change 578494 abandoned by Giuseppe Lavagetto:
Add ores, wdqs to ProductionServices

Reason:
Not needed.

https://gerrit.wikimedia.org/r/578494

Change 578493 merged by jenkins-bot:
[operations/mediawiki-config@master] Move Termbox to ProductionServices, use envoy

https://gerrit.wikimedia.org/r/578493

@Joe @akosiaris all deployments of eventgate and eventstreams have been updated to use tls.resources etc.

Krinkle updated the task description. (Show Details)Mar 11 2020, 1:42 AM

Change 578495 merged by jenkins-bot:
[operations/mediawiki-config@master] wdqs-internal: switch to use envoy

https://gerrit.wikimedia.org/r/578495

Joe added a comment.Mar 11 2020, 8:21 AM

@Joe @akosiaris all deployments of eventgate and eventstreams have been updated to use tls.resources etc.

Thanks a lot! I hope to get to switch eventgate-analytics to TLS today then!

Change 578496 merged by jenkins-bot:
[operations/mediawiki-config@master] Switch ores to use envoy

https://gerrit.wikimedia.org/r/578496

Change 582777 had a related patch set uploaded (by Giuseppe Lavagetto; owner: Giuseppe Lavagetto):
[operations/deployment-charts@master] Make configuration of envoy a ConfigMap

https://gerrit.wikimedia.org/r/582777

Change 582792 had a related patch set uploaded (by Giuseppe Lavagetto; owner: Giuseppe Lavagetto):
[operations/deployment-charts@master] Add local service proxy to the tls terminator v0.2

https://gerrit.wikimedia.org/r/582792

Change 576009 merged by jenkins-bot:
[operations/mediawiki-config@master] ProductionServices: switch eventgate-main to use envoy

https://gerrit.wikimedia.org/r/576009

Mentioned in SAL (#wikimedia-operations) [2020-03-26T12:57:13Z] <oblivian@deploy1001> Synchronized wmf-config/ProductionServices.php: eventgate-main to use envoy T244843 (duration: 01m 07s)

Change 582777 merged by jenkins-bot:
[operations/deployment-charts@master] Make configuration of envoy a ConfigMap

https://gerrit.wikimedia.org/r/582777

Change 597229 had a related patch set uploaded (by JMeybohm; owner: JMeybohm):
[operations/deployment-charts@master] tls_helper: qoute idle_timeout default value

https://gerrit.wikimedia.org/r/597229

Change 597229 merged by jenkins-bot:
[operations/deployment-charts@master] tls_helper: qoute idle_timeout default value

https://gerrit.wikimedia.org/r/597229

Change 597240 had a related patch set uploaded (by JMeybohm; owner: JMeybohm):
[operations/deployment-charts@master] tls_helper: fix typo in template reference

https://gerrit.wikimedia.org/r/597240

Change 597240 merged by jenkins-bot:
[operations/deployment-charts@master] tls_helper: fix typo in template reference

https://gerrit.wikimedia.org/r/597240

Change 597303 had a related patch set uploaded (by JMeybohm; owner: JMeybohm):
[operations/deployment-charts@master] tls_helper: fix the envoy config configmap

https://gerrit.wikimedia.org/r/597303

Change 597303 merged by jenkins-bot:
[operations/deployment-charts@master] tls_helper: fix the envoy config configmap

https://gerrit.wikimedia.org/r/597303

Change 612461 had a related patch set uploaded (by Giuseppe Lavagetto; owner: Giuseppe Lavagetto):
[operations/puppet@production] scb: add service proxy, use it in the applications.

https://gerrit.wikimedia.org/r/612461

Change 612462 had a related patch set uploaded (by Giuseppe Lavagetto; owner: Giuseppe Lavagetto):
[operations/puppet@production] maps: add the service proxy

https://gerrit.wikimedia.org/r/612462

Change 612463 had a related patch set uploaded (by Giuseppe Lavagetto; owner: Giuseppe Lavagetto):
[operations/puppet@production] maps: use the service proxy to connect to wdqs

https://gerrit.wikimedia.org/r/612463

TK-999 added a subscriber: TK-999.Jul 31 2020, 4:39 PM

Change 582792 merged by jenkins-bot:
[operations/deployment-charts@master] Add local service proxy to the tls terminator v0.2

https://gerrit.wikimedia.org/r/582792

jijiki moved this task from Incoming 🐫 to Unsorted on the serviceops board.Aug 17 2020, 11:46 PM

Change 621206 had a related patch set uploaded (by Ladsgroup; owner: Ladsgroup):
[mediawiki/services/ores/deploy@master] Set testwiki API extractor to use the internal endpoint instead

https://gerrit.wikimedia.org/r/621206

Change 621206 merged by Ladsgroup:
[mediawiki/services/ores/deploy@master] Set testwiki API extractor to use the internal endpoint instead

https://gerrit.wikimedia.org/r/621206

Mentioned in SAL (#wikimedia-operations) [2020-08-20T12:44:24Z] <oblivian@deploy1001> Started deploy [ores/deploy@a208a0e]: switch testwiki to use envoy as a service proxy T244843

Mentioned in SAL (#wikimedia-operations) [2020-08-20T12:51:27Z] <oblivian@deploy1001> Finished deploy [ores/deploy@a208a0e]: switch testwiki to use envoy as a service proxy T244843 (duration: 07m 03s)

Mentioned in SAL (#wikimedia-operations) [2020-08-20T13:00:20Z] <oblivian@deploy1001> Started deploy [ores/deploy@a208a0e]: switch testwiki to use envoy as a service proxy T244843

Mentioned in SAL (#wikimedia-operations) [2020-08-20T13:11:38Z] <oblivian@deploy1001> Finished deploy [ores/deploy@a208a0e]: switch testwiki to use envoy as a service proxy T244843 (duration: 11m 19s)

Mentioned in SAL (#wikimedia-operations) [2020-08-20T13:14:41Z] <oblivian@deploy1001> Started deploy [ores/deploy@74677b6]: switch testwiki to use envoy as a service proxy T244843 (take 2)

Mentioned in SAL (#wikimedia-operations) [2020-08-20T13:26:18Z] <oblivian@deploy1001> Finished deploy [ores/deploy@74677b6]: switch testwiki to use envoy as a service proxy T244843 (take 2) (duration: 11m 37s)

Change 621522 had a related patch set uploaded (by Ladsgroup; owner: Ladsgroup):
[mediawiki/services/ores/deploy@master] Migrate rest of wikis to Envoy

https://gerrit.wikimedia.org/r/621522

Change 621522 merged by Ladsgroup:
[mediawiki/services/ores/deploy@master] Migrate rest of wikis to Envoy

https://gerrit.wikimedia.org/r/621522

Mentioned in SAL (#wikimedia-operations) [2020-08-20T13:39:14Z] <oblivian@deploy1001> Started deploy [ores/deploy@e860508]: switch everything to use envoy as a service proxy T244843

Mentioned in SAL (#wikimedia-operations) [2020-08-20T13:53:14Z] <oblivian@deploy1001> Finished deploy [ores/deploy@e860508]: switch everything to use envoy as a service proxy T244843 (duration: 14m 00s)

Change 622580 had a related patch set uploaded (by Giuseppe Lavagetto; owner: Giuseppe Lavagetto):
[operations/deployment-charts@master] termbox: switch to use envoy to call MediaWiki

https://gerrit.wikimedia.org/r/622580

Change 622580 merged by jenkins-bot:
[operations/deployment-charts@master] termbox: switch to use envoy to call MediaWiki

https://gerrit.wikimedia.org/r/622580

Joe updated the task description. (Show Details)Sep 2 2020, 8:45 AM
JMeybohm updated the task description. (Show Details)Sep 2 2020, 1:35 PM
JMeybohm updated the task description. (Show Details)Sep 3 2020, 9:28 AM
JMeybohm updated the task description. (Show Details)Sep 3 2020, 9:58 AM

Change 624290 had a related patch set uploaded (by JMeybohm; owner: JMeybohm):
[operations/deployment-charts@master] Revert "Convert proton to the new layout"

https://gerrit.wikimedia.org/r/624290

Change 624290 merged by jenkins-bot:
[operations/deployment-charts@master] Revert "Convert proton to the new layout"

https://gerrit.wikimedia.org/r/624290

Joe updated the task description. (Show Details)Sep 7 2020, 9:39 AM
Joe updated the task description. (Show Details)Sep 8 2020, 6:37 AM
Joe updated the task description. (Show Details)Sep 8 2020, 6:44 AM
JMeybohm updated the task description. (Show Details)Sep 8 2020, 7:27 AM

Change 625839 had a related patch set uploaded (by Giuseppe Lavagetto; owner: Giuseppe Lavagetto):
[operations/deployment-charts@master] default-network-policy: allow restbase HTTPS port

https://gerrit.wikimedia.org/r/625839

Change 625839 merged by jenkins-bot:
[operations/deployment-charts@master] default-network-policy: allow restbase HTTPS port

https://gerrit.wikimedia.org/r/625839

Joe updated the task description. (Show Details)Sep 8 2020, 1:33 PM
Joe updated the task description. (Show Details)Sep 14 2020, 7:39 AM
Joe updated the task description. (Show Details)Sep 14 2020, 3:02 PM
Joe updated the task description. (Show Details)Sep 16 2020, 4:10 PM
JMeybohm updated the task description. (Show Details)Sep 17 2020, 1:52 PM

Change 628799 had a related patch set uploaded (by Giuseppe Lavagetto; owner: Giuseppe Lavagetto):
[operations/puppet@production] services: add TLS encrypted endpoint for ores (1/2)

https://gerrit.wikimedia.org/r/628799

Change 628800 had a related patch set uploaded (by Giuseppe Lavagetto; owner: Giuseppe Lavagetto):
[operations/puppet@production] services: add TLS encrypted endpoint for ores (2/2)

https://gerrit.wikimedia.org/r/628800

Change 628801 had a related patch set uploaded (by Giuseppe Lavagetto; owner: Giuseppe Lavagetto):
[operations/puppet@production] services: use TLS to connect to ORES

https://gerrit.wikimedia.org/r/628801

Change 628802 had a related patch set uploaded (by Giuseppe Lavagetto; owner: Giuseppe Lavagetto):
[operations/puppet@production] services: retire the ORES http endpoint (1/2)

https://gerrit.wikimedia.org/r/628802

Change 628803 had a related patch set uploaded (by Giuseppe Lavagetto; owner: Giuseppe Lavagetto):
[operations/puppet@production] services: retire the ORES http endpoint (2/2)

https://gerrit.wikimedia.org/r/628803

Change 628799 merged by Giuseppe Lavagetto:
[operations/puppet@production] services: add TLS encrypted endpoint for ores (1/2)

https://gerrit.wikimedia.org/r/628799

Change 628800 merged by Giuseppe Lavagetto:
[operations/puppet@production] services: add TLS encrypted endpoint for ores (2/2)

https://gerrit.wikimedia.org/r/628800

JMeybohm updated the task description. (Show Details)Sep 22 2020, 10:04 AM
Joe updated the task description. (Show Details)Sep 22 2020, 11:13 AM

Change 628801 merged by Giuseppe Lavagetto:
[operations/puppet@production] services: use TLS to connect to ORES

https://gerrit.wikimedia.org/r/628801

JMeybohm updated the task description. (Show Details)Sep 22 2020, 2:15 PM

Change 574988 abandoned by Giuseppe Lavagetto:
[operations/puppet@production] mediawiki::common: use envoy for tls termination too in nodes using it

Reason:
Superseded

https://gerrit.wikimedia.org/r/574988

Change 630537 had a related patch set uploaded (by Giuseppe Lavagetto; owner: Giuseppe Lavagetto):
[operations/deployment-charts@master] changeprop: use https to connect to ORES, restbase

https://gerrit.wikimedia.org/r/630537

Change 630537 abandoned by Giuseppe Lavagetto:
[operations/deployment-charts@master] changeprop: use https to connect to ORES, restbase

Reason:
Already merged elsewhere

https://gerrit.wikimedia.org/r/630537

Joe updated the task description. (Show Details)Sep 28 2020, 10:19 AM
Joe updated the task description. (Show Details)

Change 630562 had a related patch set uploaded (by Giuseppe Lavagetto; owner: Giuseppe Lavagetto):
[operations/puppet@production] service::configuration: connect to restbase via TLS

https://gerrit.wikimedia.org/r/630562

JMeybohm updated the task description. (Show Details)Sep 29 2020, 8:34 AM

Change 628802 merged by JMeybohm:
[operations/puppet@production] services: retire the ORES http endpoint (1/2)

https://gerrit.wikimedia.org/r/628802

Change 628803 merged by JMeybohm:
[operations/puppet@production] services: retire the ORES http endpoint (2/2)

https://gerrit.wikimedia.org/r/628803

Mentioned in SAL (#wikimedia-operations) [2020-10-01T14:42:42Z] <jayme> running puppet on lvs servers - T244843 T255878

Mentioned in SAL (#wikimedia-operations) [2020-10-01T14:48:36Z] <jayme> restarting pybal on lvs2010.codfw.wmnet - T244843 T255878

Mentioned in SAL (#wikimedia-operations) [2020-10-01T14:50:02Z] <jayme> restarting pybal on lvs1015.eqiad.wmnet,lvs2009.codfw.wmnet - T244843 T255878

Mentioned in SAL (#wikimedia-operations) [2020-10-01T14:53:48Z] <jayme> running ipvsadm -D -t 10.2.1.10:8081; ipvsadm -D -t 10.2.1.47:8889 on lvs2010.codfw.wmnet,lvs2009.codfw.wmnet - T244843 T255878

Mentioned in SAL (#wikimedia-operations) [2020-10-01T14:55:43Z] <jayme> running ipvsadm -D -t 10.2.2.10:8081; ipvsadm -D -t 10.2.2.47:8889 on lvs1015.eqiad.wmnet - T244843 T255878

JMeybohm updated the task description. (Show Details)Oct 2 2020, 8:35 AM
JMeybohm updated the task description. (Show Details)Oct 2 2020, 9:22 AM
JMeybohm updated the task description. (Show Details)Oct 2 2020, 9:27 AM

Change 630562 merged by Giuseppe Lavagetto:
[operations/puppet@production] service::configuration: connect to restbase via TLS

https://gerrit.wikimedia.org/r/630562

Change 578497 abandoned by Giuseppe Lavagetto:
[operations/mediawiki-config@master] Switch restbase to use envoy

Reason:

https://gerrit.wikimedia.org/r/578497