Page MenuHomePhabricator

Create a service-to-service proxy for handling HTTP calls from services to other entities
Open, HighPublic

Description

With the scalability issues we've been seeing on php-fpm when a lot of higher-latency http calls are involved, the necessity of having a proxy that can handle connections between services has become apparent.

More in general, we want to have a middleware that allows us to generically have the following capabilities, when dealing with RPC calls to other services:

  • Allow connection pooling
  • Work well with our DNS discovery mechanism
  • Enable TLS e2e without the need for relying on every single service doing encryption the "right" way
  • Allow configuring per-endpoint timeouts.
  • Global and local-only rate limiting
  • Allow monitoring RPC calls (telemetry and tracing)
  • Tracing of RPC calls

We've evaluated nginx in the past, and the non-commercial version lacks in even the most important of these features, as it can either support dns discovery or connection pooling, not both. We already use envoy as a TLS terminator on most servers, so we can probably use it to implement such a middleware, which is also what envoy was designed for.

Details

Related Gerrit Patches:
operations/deployment-charts : masterMake configuration of envoy a ConfigMap
operations/mediawiki-config : masterSwitch restbase to use envoy
operations/mediawiki-config : masterProductionServices: switch eventgate-main to use envoy
operations/deployment-charts : masterAdd local service proxy to the tls terminator v0.2
operations/mediawiki-config : masterSwitch ores to use envoy
operations/mediawiki-config : masterwdqs-internal: switch to use envoy
operations/mediawiki-config : masterMove Termbox to ProductionServices, use envoy
operations/mediawiki-config : masterAdd ores, wdqs to ProductionServices
operations/mediawiki-config : masterUse Envoy to talk to echostore
operations/deployment-charts : masterBump up memory limits for echostore
operations/deployment-charts : mastertls: Supply sane default values for resources
operations/deployment-charts : masterPackage charts that support the new resource limits
operations/mediawiki-config : masterProductionServices:switch eventgate-analytics to use envoy
operations/mediawiki-config : masterProductionServices: use envoy to connect to mathoid
operations/deployment-charts : mastersessionstore: bump memory limits in production
operations/deployment-charts : mastersessionstore: Bump memory limits
operations/mediawiki-config : masterProductionServices: Revert to using discovery for sessionstore.
operations/mediawiki-config : masterProductionServices: use the local proxy for sessionstore
operations/mediawiki-config : masterProductionServices: use local http proxy for parsoid, parsoidphp
operations/puppet : productionmediawiki: stop installing the nginx-based proxy
operations/puppet : productionmediawiki: stop installing the nginx proxy on the canaries
operations/mediawiki-config : masterProductionServices: switch search to use envoy instead of nginx
operations/puppet : productionrole::mediawiki::common: install envoy as a forward proxy everywhere.
operations/puppet : productionmediawiki::common: use envoy for tls termination too in nodes using it
operations/puppet : productionservices_proxy::envoy: specify preferred ciphers
operations/puppet : productionprofile::services_proxy: envoy-based version
operations/puppet : productionenvoy: split base profile out of tlsproxy

Event Timeline

Joe created this task.Feb 11 2020, 10:18 AM
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptFeb 11 2020, 10:18 AM
Joe triaged this task as High priority.Feb 11 2020, 10:19 AM
Joe claimed this task.Feb 17 2020, 12:54 PM

Change 572831 had a related patch set uploaded (by Giuseppe Lavagetto; owner: Giuseppe Lavagetto):
[operations/puppet@production] envoy: split base profile out of tlsproxy

https://gerrit.wikimedia.org/r/572831

Change 572832 had a related patch set uploaded (by Giuseppe Lavagetto; owner: Giuseppe Lavagetto):
[operations/puppet@production] profile::services_proxy: envoy-based version

https://gerrit.wikimedia.org/r/572832

Change 572831 merged by Giuseppe Lavagetto:
[operations/puppet@production] envoy: split base profile out of tlsproxy

https://gerrit.wikimedia.org/r/572831

Change 572832 merged by Giuseppe Lavagetto:
[operations/puppet@production] profile::services_proxy: envoy-based version

https://gerrit.wikimedia.org/r/572832

Change 574988 had a related patch set uploaded (by Giuseppe Lavagetto; owner: Giuseppe Lavagetto):
[operations/puppet@production] mediawiki::common: use envoy for tls termination too in nodes using it

https://gerrit.wikimedia.org/r/574988

Change 575015 had a related patch set uploaded (by Giuseppe Lavagetto; owner: Giuseppe Lavagetto):
[operations/puppet@production] services_proxy::envoy: specify preferred ciphers

https://gerrit.wikimedia.org/r/575015

Change 575015 merged by Giuseppe Lavagetto:
[operations/puppet@production] services_proxy::envoy: specify preferred ciphers

https://gerrit.wikimedia.org/r/575015

Change 575225 had a related patch set uploaded (by Giuseppe Lavagetto; owner: Giuseppe Lavagetto):
[operations/puppet@production] role::mediawiki::common: install envoy as a forward proxy everywhere.

https://gerrit.wikimedia.org/r/575225

Change 575225 merged by Giuseppe Lavagetto:
[operations/puppet@production] role::mediawiki::common: install envoy as a forward proxy everywhere.

https://gerrit.wikimedia.org/r/575225

Change 575268 had a related patch set uploaded (by Giuseppe Lavagetto; owner: Giuseppe Lavagetto):
[operations/mediawiki-config@master] ProductionServices: switch search to use envoy instead of nginx

https://gerrit.wikimedia.org/r/575268

Change 575269 had a related patch set uploaded (by Giuseppe Lavagetto; owner: Giuseppe Lavagetto):
[operations/mediawiki-config@master] ProductionServices: use local http proxy for parsoid, parsoidphp

https://gerrit.wikimedia.org/r/575269

Change 575270 had a related patch set uploaded (by Giuseppe Lavagetto; owner: Giuseppe Lavagetto):
[operations/mediawiki-config@master] ProductionServices: use the local proxy for sessionstore

https://gerrit.wikimedia.org/r/575270

Change 576007 had a related patch set uploaded (by Giuseppe Lavagetto; owner: Giuseppe Lavagetto):
[operations/mediawiki-config@master] ProductionServices: use envoy to connect to mathoid

https://gerrit.wikimedia.org/r/576007

Change 576008 had a related patch set uploaded (by Giuseppe Lavagetto; owner: Giuseppe Lavagetto):
[operations/mediawiki-config@master] ProductionServices:switch eventgate-analytics to use envoy

https://gerrit.wikimedia.org/r/576008

Change 576009 had a related patch set uploaded (by Giuseppe Lavagetto; owner: Giuseppe Lavagetto):
[operations/mediawiki-config@master] ProductionServices: switch eventgate-main to use envoy

https://gerrit.wikimedia.org/r/576009

Change 575268 merged by jenkins-bot:
[operations/mediawiki-config@master] ProductionServices: switch search to use envoy instead of nginx

https://gerrit.wikimedia.org/r/575268

Change 576067 had a related patch set uploaded (by Giuseppe Lavagetto; owner: Giuseppe Lavagetto):
[operations/puppet@production] mediawiki: stop installing the nginx proxy on the canaries

https://gerrit.wikimedia.org/r/576067

Change 576067 merged by Giuseppe Lavagetto:
[operations/puppet@production] mediawiki: stop installing the nginx proxy on the canaries

https://gerrit.wikimedia.org/r/576067

Change 576078 had a related patch set uploaded (by Giuseppe Lavagetto; owner: Giuseppe Lavagetto):
[operations/puppet@production] mediawiki: stop installing the nginx-based proxy

https://gerrit.wikimedia.org/r/576078

Change 576078 merged by Giuseppe Lavagetto:
[operations/puppet@production] mediawiki: stop installing the nginx-based proxy

https://gerrit.wikimedia.org/r/576078

Change 575269 merged by jenkins-bot:
[operations/mediawiki-config@master] ProductionServices: use local http proxy for parsoid, parsoidphp

https://gerrit.wikimedia.org/r/575269

Change 575270 merged by jenkins-bot:
[operations/mediawiki-config@master] ProductionServices: use the local proxy for sessionstore

https://gerrit.wikimedia.org/r/575270

Change 577208 had a related patch set uploaded (by Hnowlan; owner: Hnowlan):
[operations/mediawiki-config@master] ProductionServices: Revert to using discovery for sessionstore.

https://gerrit.wikimedia.org/r/577208

Change 577208 abandoned by Hnowlan:
ProductionServices: Revert to using discovery for sessionstore.

Reason:
addressed in 577207

https://gerrit.wikimedia.org/r/577208

Change 578301 had a related patch set uploaded (by Alexandros Kosiaris; owner: Alexandros Kosiaris):
[operations/deployment-charts@master] sessionstore: Bump memory limits

https://gerrit.wikimedia.org/r/578301

Change 578301 merged by jenkins-bot:
[operations/deployment-charts@master] sessionstore: Bump memory limits

https://gerrit.wikimedia.org/r/578301

Ladsgroup added a subscriber: Ladsgroup.

Change 578318 had a related patch set uploaded (by Giuseppe Lavagetto; owner: Giuseppe Lavagetto):
[operations/deployment-charts@master] sessionstore: bump memory limits in production

https://gerrit.wikimedia.org/r/578318

Change 578318 merged by Giuseppe Lavagetto:
[operations/deployment-charts@master] sessionstore: bump memory limits in production

https://gerrit.wikimedia.org/r/578318

Change 576007 merged by Giuseppe Lavagetto:
[operations/mediawiki-config@master] ProductionServices: use envoy to connect to mathoid

https://gerrit.wikimedia.org/r/576007

Change 576008 merged by jenkins-bot:
[operations/mediawiki-config@master] ProductionServices:switch eventgate-analytics to use envoy

https://gerrit.wikimedia.org/r/576008

Change 578478 had a related patch set uploaded (by Alexandros Kosiaris; owner: Alexandros Kosiaris):
[operations/deployment-charts@master] tls: Supply sane default values for resources

https://gerrit.wikimedia.org/r/578478

Change 578478 merged by jenkins-bot:
[operations/deployment-charts@master] tls: Supply sane default values for resources

https://gerrit.wikimedia.org/r/578478

Change 578480 had a related patch set uploaded (by Alexandros Kosiaris; owner: Alexandros Kosiaris):
[operations/deployment-charts@master] Package charts that support the new resource limits

https://gerrit.wikimedia.org/r/578480

Change 578480 merged by jenkins-bot:
[operations/deployment-charts@master] Package charts that support the new resource limits

https://gerrit.wikimedia.org/r/578480

Mentioned in SAL (#wikimedia-operations) [2020-03-10T09:21:28Z] <akosiaris> update blubberoid, cxserver, citoid to push the TLS resources changes T244843

Copying from the last comment of https://gerrit.wikimedia.org/r/578478

@Ottomata. eventgate and eventstreams don't use the shared _tls_helpers and as such can't benefit from this but rather the changes have to be applied to those charts manually. Could you please take care of that? This is a blocker for switching eventstreams and eventgate-{analytics,main} to TLS

Change 578492 had a related patch set uploaded (by Giuseppe Lavagetto; owner: Giuseppe Lavagetto):
[operations/mediawiki-config@master] Use Envoy to talk to echostore

https://gerrit.wikimedia.org/r/578492

Change 578493 had a related patch set uploaded (by Giuseppe Lavagetto; owner: Giuseppe Lavagetto):
[operations/mediawiki-config@master] Move Termbox to ProductionServices, use envoy

https://gerrit.wikimedia.org/r/578493

Change 578494 had a related patch set uploaded (by Giuseppe Lavagetto; owner: Giuseppe Lavagetto):
[operations/mediawiki-config@master] Add ores, wdqs to ProductionServices

https://gerrit.wikimedia.org/r/578494

Change 578495 had a related patch set uploaded (by Giuseppe Lavagetto; owner: Giuseppe Lavagetto):
[operations/mediawiki-config@master] wdqs-internal: switch to use envoy

https://gerrit.wikimedia.org/r/578495

Change 578496 had a related patch set uploaded (by Giuseppe Lavagetto; owner: Giuseppe Lavagetto):
[operations/mediawiki-config@master] Switch ores to use envoy

https://gerrit.wikimedia.org/r/578496

Change 578497 had a related patch set uploaded (by Giuseppe Lavagetto; owner: Giuseppe Lavagetto):
[operations/mediawiki-config@master] Switch restbase to use envoy

https://gerrit.wikimedia.org/r/578497

Change 578503 had a related patch set uploaded (by Giuseppe Lavagetto; owner: Giuseppe Lavagetto):
[operations/deployment-charts@master] Bump up memory limits for echostore

https://gerrit.wikimedia.org/r/578503

Change 578503 merged by jenkins-bot:
[operations/deployment-charts@master] Bump up memory limits for echostore

https://gerrit.wikimedia.org/r/578503

@akosiaris Hm, yes, let's try! We are going to have issues with changes I made as part of T242861: Clarify multi-service instance concepts in helm charts and enable canary releases. The two main changes are:

  1. Every resource gets at least the following labels:
labels:
  chart: {{ template "wmf.chartname" . }}   # eventgate
  app: {{ .Values.main_app.name }}          # eventgate-main
  release: {{ .Release.Name }}              # production or canary
  1. The Service uses the routing_tag label as a selector to select which pods it should route to. The Deployment pod template has labels:
labels:
  chart: {{ template "wmf.chartname" . }}
  app: {{ .Values.main_app.name }}
  release: {{ .Release.Name }}
  routing_tag: {{ .Values.service.routing_tag | default .Release.Name }}

(Hm, it looks like I lost using wmf.chartid as a label somewhere, happy to add that back in somehow.
I think to do this we have finish resolving T242861 for all charts, not just eventgate and eventstreams.

Change 578492 merged by jenkins-bot:
[operations/mediawiki-config@master] Use Envoy to talk to echostore

https://gerrit.wikimedia.org/r/578492

Change 578494 abandoned by Giuseppe Lavagetto:
Add ores, wdqs to ProductionServices

Reason:
Not needed.

https://gerrit.wikimedia.org/r/578494

Change 578493 merged by jenkins-bot:
[operations/mediawiki-config@master] Move Termbox to ProductionServices, use envoy

https://gerrit.wikimedia.org/r/578493

@Joe @akosiaris all deployments of eventgate and eventstreams have been updated to use tls.resources etc.

Krinkle updated the task description. (Show Details)Mar 11 2020, 1:42 AM

Change 578495 merged by jenkins-bot:
[operations/mediawiki-config@master] wdqs-internal: switch to use envoy

https://gerrit.wikimedia.org/r/578495

Joe added a comment.Mar 11 2020, 8:21 AM

@Joe @akosiaris all deployments of eventgate and eventstreams have been updated to use tls.resources etc.

Thanks a lot! I hope to get to switch eventgate-analytics to TLS today then!

Change 578496 merged by jenkins-bot:
[operations/mediawiki-config@master] Switch ores to use envoy

https://gerrit.wikimedia.org/r/578496

Change 582777 had a related patch set uploaded (by Giuseppe Lavagetto; owner: Giuseppe Lavagetto):
[operations/deployment-charts@master] Make configuration of envoy a ConfigMap

https://gerrit.wikimedia.org/r/582777

Change 582792 had a related patch set uploaded (by Giuseppe Lavagetto; owner: Giuseppe Lavagetto):
[operations/deployment-charts@master] Add local service proxy to the tls terminator v0.2

https://gerrit.wikimedia.org/r/582792

Change 576009 merged by jenkins-bot:
[operations/mediawiki-config@master] ProductionServices: switch eventgate-main to use envoy

https://gerrit.wikimedia.org/r/576009

Mentioned in SAL (#wikimedia-operations) [2020-03-26T12:57:13Z] <oblivian@deploy1001> Synchronized wmf-config/ProductionServices.php: eventgate-main to use envoy T244843 (duration: 01m 07s)