Page MenuHomePhabricator

Deploy the RESTBase front-end service (RESTRouter) to Kubernetes
Closed, DeclinedPublic0 Estimated Story Points

Description

We are splitting RESTBase in two components - the (public) REST API router and the storage service (cf. T220449: Split RESTBase in two services: storage service and API router/proxy). This task is about deploying the front-end REST router in Kubernetes.

Service Info

Service name: RESTRouter (name still under discussion, cf. T220761)
Owners: @Pchelolo and @mobrovac (Platform Engineering)
Repository: mediawiki/services/restbase
ETA: by the end of Q4 FY18/19
Description: RESTRouter is the routing part of (the current) RESTBase. It accepts external requests, validates them (performs access checks if needed) and performs all the business logic related to the request: it looks up the storage for possible data hits and, if needed, issues requests to back-end services to complete the requests, sending the response to storage prior to returning it to the client.

Deployment Plans

Restrouter migration plans. Some parts are the same for all plans. Those are listed below

  • First, we deploy RESTRouter to k8s.
  • we expose the storage routes in RESTBase (cf. PR #1103)
  • test RESTRouter for load (options include synthetic traffic, mirroring, using only background updates/internal requests).

Plan 1

Have restbase listen on both 7231 and 7233 and configure LVS restbase.svc.$::site.wmnet to also use 7233
Instantiate restrouter on a new LVS IP and DNS (restrouter.svc.$::site.wmnet) and have it talk to restbase.svc.$::site.wmnet:7233
Move services 1 by 1 to restrouter.discovery.wmnet (the site aware discovery records for restrouter.svc.$::site.wmnet)

Pros

  • Move is gradual on a service level. Services are migrated one by one based on their configuration unearthing potential problems one by one
  • The currently stable and battle tested restbase installation is kept around even while more and more services are moved around
  • It's rather easy configuration wise, rather easy to do in steps
  • No downtime for services.

Cons

  • The migration might take time as when issues arise, but at least blockers will be service specific
  • There is no gradual traffic switchover. For every service it's a "canary host first", then all or nothing approach. Even the canary host is depending on DC between 13% and 25% of traffic

Plan 2

Have restbase listen on both 7231 and 7233
Add a new LVS IP on the restbase hosts and name it restbase-backend.svc.$::site.wmnet
Configure restrouter to connect to restbase-backend.svc.$::site.wmnet:7233
Add the LVS IP for restbase.svc.$::site.wmnet to kubernetes hosts
Add the kubernetes hosts to LVS for restbase.svc.$::site.wmnet
Slowly migrate the traffic from the current restbase hosts to kubernetes hosts

Pros

  • The services see 0 changes. Everything happens transparently to them.
  • The move of traffic is gradual allowing to rollback quickly and easily, as well as pause the migration
  • No downtime for services

Cons

  • Rather convoluted configuration wise, with some margin for mistakes
  • All or nothing approach as far as services go. No way to distinguish between them
  • The migration might take a long time as when issues arise they will probably be global blockers for all services
  • Rollbacks are possible, but if issues arise, it's probably going to be a full rollback to the old installation
  • The ending restbase.svc.$::site.wmnet DNS does not reflect the actual software powering the frontend, aka restrouter possibly leading to future misunderstandings/confusion

Post migration

In the post-deploy clean-up step, we remove public route handling from RESTBase, effectively turning it into the back-end storage service.

Comment from Giuseppe:

I think plan 1 is much simpler. It requires more patches and more attention to not leave anything behind, but it's probably the better plan. Please be mindful that restrouter will need to be terminating SSL as well, like restbase does.
I vote plan 1.

Marko:

  • my vote goes for plan 1 as well, even though it will probably take longer, it makes it clear to all parties involved that changes are happening; that means that also service owners will be more aware in case of problems so they will be easier to detect
  • i agree that the end result is better with restrouter.svc than restbase.svc

RESTRouter will effectively take over request handling from RESTBase, so we will need to divert traffic to it without interruption.

Benchmarking:

Details

SubjectRepoBranchLines +/-
operations/deployment-chartsmaster+119 -94
operations/deployment-chartsmaster+115 -91
operations/deployment-chartsmaster+116 -93
operations/deployment-chartsmaster+0 -6
operations/deployment-chartsmaster+27 -8
operations/deployment-chartsmaster+2 -0
operations/dnsmaster+2 -2
operations/deployment-chartsmaster+6 -0
operations/deployment-chartsmaster+3 -3
operations/puppetproduction+72 -12
operations/dnsmaster+6 -0
operations/deployment-chartsmaster+3 -3
operations/deployment-chartsmaster+6 -0
operations/deployment-chartsmaster+27 -26
mediawiki/services/restbase/deploymaster+10 -0
operations/deployment-chartsmaster+107 -84
operations/deployment-chartsmaster+33 -32
operations/puppetproduction+52 -24
operations/puppetproduction+5 -1
mediawiki/services/restbase/deploymaster+10 -1
operations/puppetproduction+49 -0
operations/deployment-chartsmaster+98 -75
operations/deployment-chartsmaster+390 -0
operations/deployment-chartsmaster+66 -66
operations/deployment-chartsmaster+1 K -64
Show related patches Customize query in gerrit

Related Objects

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

Change 526448 had a related patch set uploaded (by Alexandros Kosiaris; owner: Alexandros Kosiaris):
[operations/dns@master] Assign restrouter LVS IPs

https://gerrit.wikimedia.org/r/526448

Change 526449 had a related patch set uploaded (by Alexandros Kosiaris; owner: Alexandros Kosiaris):
[operations/dns@master] Activate restrouter discovery records

https://gerrit.wikimedia.org/r/526449

Change 526632 had a related patch set uploaded (by Alexandros Kosiaris; owner: Alexandros Kosiaris):
[operations/puppet@production] restrouter: Add kubernetes stanzas

https://gerrit.wikimedia.org/r/526632

Change 526719 had a related patch set uploaded (by Alexandros Kosiaris; owner: Alexandros Kosiaris):
[operations/deployment-charts@master] restrouter: Add helmfile stanzas

https://gerrit.wikimedia.org/r/526719

Change 527130 had a related patch set uploaded (by Alexandros Kosiaris; owner: Alexandros Kosiaris):
[operations/deployment-charts@master] restrouter: Switch to event_service_uri

https://gerrit.wikimedia.org/r/527130

Change 526719 merged by Alexandros Kosiaris:
[operations/deployment-charts@master] restrouter: Add helmfile stanzas

https://gerrit.wikimedia.org/r/526719

Change 527130 merged by Alexandros Kosiaris:
[operations/deployment-charts@master] restrouter: Switch to event_service_uri

https://gerrit.wikimedia.org/r/527130

Change 526632 merged by Alexandros Kosiaris:
[operations/puppet@production] restrouter: Add kubernetes stanzas

https://gerrit.wikimedia.org/r/526632

restrouter was temporarily deployed in the staging cluster today. Deployment was rolled back as it was failing, trying to reach out to restbase on port 7233, where restbase does not listen on yet. As soon as we figure out the exact details of the migration plan this should be ready to go. Those are

  • Restbase listening on port 7233 as well
  • Deciding the best plan on how to switchover the traffic (percentage based, per service based)

Change 521572 merged by Mobrovac:
[mediawiki/services/restbase/deploy@master] Expose both ports 7231 and 7233.

https://gerrit.wikimedia.org/r/521572

Mentioned in SAL (#wikimedia-operations) [2019-08-26T13:06:05Z] <mobrovac@deploy1001> Started deploy [restbase/deploy@38c313d]: Bring the dev cluster up to date and expose RB on both 7231 and 7233 in it - T223953

Mentioned in SAL (#wikimedia-operations) [2019-08-26T13:06:10Z] <mobrovac@deploy1001> deploy aborted: Bring the dev cluster up to date and expose RB on both 7231 and 7233 in it - T223953 (duration: 00m 04s)

Mentioned in SAL (#wikimedia-operations) [2019-08-26T13:06:26Z] <mobrovac@deploy1001> Started deploy [restbase/deploy@38c313d] (dev-cluster): Bring the dev cluster up to date and expose RB on both 7231 and 7233 in it - T223953

Mentioned in SAL (#wikimedia-operations) [2019-08-26T13:09:48Z] <mobrovac@deploy1001> Finished deploy [restbase/deploy@38c313d] (dev-cluster): Bring the dev cluster up to date and expose RB on both 7231 and 7233 in it - T223953 (duration: 03m 22s)

Mentioned in SAL (#wikimedia-operations) [2019-08-26T13:15:58Z] <mobrovac@deploy1001> Started deploy [restbase/deploy@38c313d]: Expose RB on both 7231 and 7233 - T223953

Mentioned in SAL (#wikimedia-operations) [2019-08-26T13:38:57Z] <mobrovac@deploy1001> Finished deploy [restbase/deploy@38c313d]: Expose RB on both 7231 and 7233 - T223953 (duration: 23m 00s)

Change 532382 had a related patch set uploaded (by Mobrovac; owner: Mobrovac):
[operations/puppet@production] RESTBase: Temporarily allow access to port 7233 as well

https://gerrit.wikimedia.org/r/532382

Change 534430 had a related patch set uploaded (by Alexandros Kosiaris; owner: Alexandros Kosiaris):
[operations/puppet@production] WIP: LVS: Setup port 7233 for restbase-backend

https://gerrit.wikimedia.org/r/534430

Change 532382 merged by Alexandros Kosiaris:
[operations/puppet@production] RESTBase: Temporarily allow access to port 7233 as well

https://gerrit.wikimedia.org/r/532382

Going forward with Plan #1 (which I also find better)

Change 534430 merged by Alexandros Kosiaris:
[operations/puppet@production] LVS: Setup port 7233 for restbase-backend

https://gerrit.wikimedia.org/r/534430

Change 538238 had a related patch set uploaded (by Mobrovac; owner: Mobrovac):
[operations/deployment-charts@master] RESTRouter: Clean up the config && add the wikifeeds URI

https://gerrit.wikimedia.org/r/538238

Change 538238 merged by Alexandros Kosiaris:
[operations/deployment-charts@master] RESTRouter: Clean up the config && add the wikifeeds URI

https://gerrit.wikimedia.org/r/538238

Change 538242 had a related patch set uploaded (by Alexandros Kosiaris; owner: Alexandros Kosiaris):
[operations/deployment-charts@master] Release restrouter chart version 0.0.3

https://gerrit.wikimedia.org/r/538242

Change 538242 merged by Alexandros Kosiaris:
[operations/deployment-charts@master] Release restrouter chart version 0.0.3

https://gerrit.wikimedia.org/r/538242

Change 538288 had a related patch set uploaded (by Mobrovac; owner: Mobrovac):
[mediawiki/services/restbase/deploy@master] Expose the key_value buckets to production IPs

https://gerrit.wikimedia.org/r/538288

Change 538288 merged by Mobrovac:
[mediawiki/services/restbase/deploy@master] Expose the key_value buckets to production IPs

https://gerrit.wikimedia.org/r/538288

Mentioned in SAL (#wikimedia-operations) [2019-09-24T10:29:22Z] <mobrovac@deploy1001> Started deploy [restbase/deploy@19d0f44]: Expose the key_value buckets to production IPs - T223953

Mentioned in SAL (#wikimedia-operations) [2019-09-24T10:51:41Z] <mobrovac@deploy1001> Finished deploy [restbase/deploy@19d0f44]: Expose the key_value buckets to production IPs - T223953 (duration: 22m 20s)

Change 538882 had a related patch set uploaded (by Mobrovac; owner: Mobrovac):
[operations/deployment-charts@master] RESTRouter: Add missing back-end svc URIs

https://gerrit.wikimedia.org/r/538882

Change 538882 merged by jenkins-bot:
[operations/deployment-charts@master] RESTRouter: Add missing back-end svc URIs

https://gerrit.wikimedia.org/r/538882

Change 538894 had a related patch set uploaded (by Alexandros Kosiaris; owner: Alexandros Kosiaris):
[operations/deployment-charts@master] restrouter: Skip probes for the first 60 seconds

https://gerrit.wikimedia.org/r/538894

Change 538894 merged by jenkins-bot:
[operations/deployment-charts@master] restrouter: Skip probes for the first 60 seconds

https://gerrit.wikimedia.org/r/538894

Change 538899 had a related patch set uploaded (by Alexandros Kosiaris; owner: Alexandros Kosiaris):
[operations/deployment-charts@master] restrouter: Skip using https for mwapi_uri

https://gerrit.wikimedia.org/r/538899

Change 538899 merged by jenkins-bot:
[operations/deployment-charts@master] restrouter: Skip using https for mwapi_uri

https://gerrit.wikimedia.org/r/538899

Change 526448 merged by Alexandros Kosiaris:
[operations/dns@master] Assign restrouter LVS IPs

https://gerrit.wikimedia.org/r/526448

Change 521584 merged by Alexandros Kosiaris:
[operations/puppet@production] LVS for RESTRouter.

https://gerrit.wikimedia.org/r/521584

Change 539109 had a related patch set uploaded (by Alexandros Kosiaris; owner: Alexandros Kosiaris):
[operations/deployment-charts@master] restrouter: Fix the parsoid port in the configuration

https://gerrit.wikimedia.org/r/539109

Change 539109 merged by Alexandros Kosiaris:
[operations/deployment-charts@master] restrouter: Fix the parsoid port in the configuration

https://gerrit.wikimedia.org/r/539109

Change 539115 had a related patch set uploaded (by Alexandros Kosiaris; owner: Alexandros Kosiaris):
[operations/deployment-charts@master] calico: Add port 8000 (parsoid) to restrouter

https://gerrit.wikimedia.org/r/539115

Change 539115 merged by Alexandros Kosiaris:
[operations/deployment-charts@master] calico: Add port 8000 (parsoid) to restrouter

https://gerrit.wikimedia.org/r/539115

Change 526449 merged by Alexandros Kosiaris:
[operations/dns@master] Activate restrouter discovery records

https://gerrit.wikimedia.org/r/526449

akosiaris claimed this task.

restrouter is up and running, LVS is setup and discovery records have been merged. I think the migration can start. A draft dashboard is present at https://grafana.wikimedia.org/d/ZA_JiypZk/restrouter, however restrouter differs enough from the rest of the other service-runner based services as far as the statsd emitted metrics goes, that I don't feel qualified to delve more into this. Feel free to amend it to your needs.

I 'll resolve this for now, we should try the migration into a different task.

Reopening as there are two more things we have to do before RESTRouter can be used:

  • decrease the service start-up time (currently at ~55s, which is too long for production use)
  • set up the rate-limiting DHT inside k8s for RESTRouter (this is currently disabled, and not having rate-limiting is not acceptable)

I am working on the former. For the latter, @akosiaris we'll have to get creative.

Change 539280 had a related patch set uploaded (by Mobrovac; owner: Mobrovac):
[operations/deployment-charts@master] RESTRouter: Skip resources on start-up and add nqo.wp.org

https://gerrit.wikimedia.org/r/539280

  • set up the rate-limiting DHT inside k8s for RESTRouter (this is currently disabled, and not having rate-limiting is not acceptable)

I think we are now in a position to actually do that, but I was wondering if we have numbers about how often we rate-limit clients in restbase.

Change 539280 merged by jenkins-bot:
[operations/deployment-charts@master] RESTRouter: Skip resources on start-up and add nqo.wp.org

https://gerrit.wikimedia.org/r/539280

@akosiaris regarding rate limiting, you mentioned a (semi-)permanent DNS entry. We can set that up, but the important bit is to have it always pointing to an active pod. That means that it has to be stable during transitions, i.e. deployments of new versions of RESTRouter. The way the rate-limiting DHT works is that a new process (node/pod) contacts an existing one and joins the network. There will be a bit of churn during deploy windows, but that is tolerable as long as new pods are contacting a pod that will stick around after the deploy. That obviously will not be the case for the first pod in a deployment, but that should be fine as long as the DNS can be switched easily during the deployment.

Having rate-limiting is really a crucial feature without which we cannot start using RESTRouter in production.

@akosiaris regarding rate limiting, you mentioned a (semi-)permanent DNS entry.

An automatically updated one that is local to the kubernetes cluster (and not really visible outside of it). We already have it for cxserver, e.g.

$ dig cxserver-production-kademlia.cxserver.svc.cluster.local
<snip>
cxserver-production-kademlia.cxserver.svc.cluster.local. 5 IN A	10.64.65.213
cxserver-production-kademlia.cxserver.svc.cluster.local. 5 IN A	10.64.65.149
cxserver-production-kademlia.cxserver.svc.cluster.local. 5 IN A	10.64.65.24
cxserver-production-kademlia.cxserver.svc.cluster.local. 5 IN A	10.64.65.151
cxserver-production-kademlia.cxserver.svc.cluster.local. 5 IN A	10.64.65.19
cxserver-production-kademlia.cxserver.svc.cluster.local. 5 IN A	10.64.65.129
cxserver-production-kademlia.cxserver.svc.cluster.local. 5 IN A	10.64.65.239
cxserver-production-kademlia.cxserver.svc.cluster.local. 5 IN A	10.64.65.231

We can set that up, but the important bit is to have it always pointing to an active pod. That means that it has to be stable during transitions, i.e. deployments of new versions of RESTRouter. The way the rate-limiting DHT works is that a new process (node/pod) contacts an existing one and joins the network. There will be a bit of churn during deploy windows, but that is tolerable as long as new pods are contacting a pod that will stick around after the deploy. That obviously will not be the case for the first pod in a deployment, but that should be fine as long as the DNS can be switched easily during the deployment.

It will always be pointing to all active pods and it's up to the client library to pick whichever one it wants. During deployments, the DNS record will be updated as the deployment progresses in a rolling fashion removing old pods and adding new ones. Given the default 25% rate for a rolling deployment, at least 75% of pods will be under that record. There will be however no pod that "sticks" around after the deploy, but given the above I don't think it's necessary, right?

Having rate-limiting is really a crucial feature without which we cannot start using RESTRouter in production.

As I 've already said, we should be having graphs in grafana about such a crucial feature. It's great we already have logs (it will help us immensely in the migration), but stats are essential as well.

Change 540131 had a related patch set uploaded (by Alexandros Kosiaris; owner: Alexandros Kosiaris):
[operations/deployment-charts@master] restrouter: Add ratelimiting support to chart

https://gerrit.wikimedia.org/r/540131

Change 540131 merged by jenkins-bot:
[operations/deployment-charts@master] restrouter: Add ratelimiting support to chart

https://gerrit.wikimedia.org/r/540131

Change 540365 had a related patch set uploaded (by Alexandros Kosiaris; owner: Alexandros Kosiaris):
[operations/deployment-charts@master] restrouter: Revert the initialDelay seconds

https://gerrit.wikimedia.org/r/540365

Change 540365 merged by Mobrovac:
[operations/deployment-charts@master] restrouter: Revert the initialDelay seconds

https://gerrit.wikimedia.org/r/540365

Change 540841 had a related patch set uploaded (by Mobrovac; owner: Mobrovac):
[operations/deployment-charts@master] RESTRouter: Bump image tag to v1.1.2 and release v0.0.7

https://gerrit.wikimedia.org/r/540841

Change 540841 merged by jenkins-bot:
[operations/deployment-charts@master] RESTRouter: Bump image tag to v1.1.2 and release v0.0.7

https://gerrit.wikimedia.org/r/540841

The start-up time is now pretty good: around 3-5s per worker.

However, it seems that rate limiting is not working. I issued requests for restrouter.svc.eqiad.wmnet:7231/wikimedia.org/v1/metrics/pageviews/aggregate/en.wikipedia/all-access/all-agents/hourly/1970010100/1970010100 - a route that is limited to 100 req/s - but after issuing thousands of requests with varying concurrency no rate-limiting logs were produced.

Change 541278 had a related patch set uploaded (by Alexandros Kosiaris; owner: Alexandros Kosiaris):
[operations/deployment-charts@master] restrouter: Kadelmia should listen on all IPs

https://gerrit.wikimedia.org/r/541278

Change 541278 merged by jenkins-bot:
[operations/deployment-charts@master] restrouter: Kademlia should listen on all IPs

https://gerrit.wikimedia.org/r/541278

Change 541771 had a related patch set uploaded (by Alexandros Kosiaris; owner: Alexandros Kosiaris):
[operations/deployment-charts@master] restrouter: Allow the kademlia port in ingress

https://gerrit.wikimedia.org/r/541771

Change 541771 merged by jenkins-bot:
[operations/deployment-charts@master] restrouter: Allow the kademlia port in ingress

https://gerrit.wikimedia.org/r/541771

In the interest of splitting off from this task what is probably going to be somewhat of a discussion, I 've created subtask T235437 for the rate limiting functionality of RESTBase/RESTrouter.

akosiaris changed the task status from Open to Stalled.Dec 16 2019, 4:19 PM