Page MenuHomePhabricator

Deploy the RESTBase front-end service (RESTRouter) to Kubernetes
Open, NormalPublic0 Story Points

Description

We are splitting RESTBase in two components - the (public) REST API router and the storage service (cf. T220449: Split RESTBase in two services: storage service and API router/proxy). This task is about deploying the front-end REST router in Kubernetes.

Service Info

Service name: RESTRouter (name still under discussion, cf. T220761)
Owners: @Pchelolo and @mobrovac (Core Platform Team)
Repository: mediawiki/services/restbase
ETA: by the end of Q4 FY18/19
Description: RESTRouter is the routing part of (the current) RESTBase. It accepts external requests, validates them (performs access checks if needed) and performs all the business logic related to the request: it looks up the storage for possible data hits and, if needed, issues requests to back-end services to complete the requests, sending the response to storage prior to returning it to the client.

Deployment Plans

Restrouter migration plans. Some parts are the same for all plans. Those are listed below

  • First, we deploy RESTRouter to k8s.
  • we expose the storage routes in RESTBase (cf. PR #1103)
  • test RESTRouter for load (options include synthetic traffic, mirroring, using only background updates/internal requests).

Plan 1

Have restbase listen on both 7231 and 7233 and configure LVS restbase.svc.$::site.wmnet to also use 7233
Instantiate restrouter on a new LVS IP and DNS (restrouter.svc.$::site.wmnet) and have it talk to restbase.svc.$::site.wmnet:7233
Move services 1 by 1 to restrouter.discovery.wmnet (the site aware discovery records for restrouter.svc.$::site.wmnet)

Pros

  • Move is gradual on a service level. Services are migrated one by one based on their configuration unearthing potential problems one by one
  • The currently stable and battle tested restbase installation is kept around even while more and more services are moved around
  • It's rather easy configuration wise, rather easy to do in steps
  • No downtime for services.

Cons

  • The migration might take time as when issues arise, but at least blockers will be service specific
  • There is no gradual traffic switchover. For every service it's a "canary host first", then all or nothing approach. Even the canary host is depending on DC between 13% and 25% of traffic

Plan 2

Have restbase listen on both 7231 and 7233
Add a new LVS IP on the restbase hosts and name it restbase-backend.svc.$::site.wmnet
Configure restrouter to connect to restbase-backend.svc.$::site.wmnet:7233
Add the LVS IP for restbase.svc.$::site.wmnet to kubernetes hosts
Add the kubernetes hosts to LVS for restbase.svc.$::site.wmnet
Slowly migrate the traffic from the current restbase hosts to kubernetes hosts

Pros

  • The services see 0 changes. Everything happens transparently to them.
  • The move of traffic is gradual allowing to rollback quickly and easily, as well as pause the migration
  • No downtime for services

Cons

  • Rather convoluted configuration wise, with some margin for mistakes
  • All or nothing approach as far as services go. No way to distinguish between them
  • The migration might take a long time as when issues arise they will probably be global blockers for all services
  • Rollbacks are possible, but if issues arise, it's probably going to be a full rollback to the old installation
  • The ending restbase.svc.$::site.wmnet DNS does not reflect the actual software powering the frontend, aka restrouter possibly leading to future misunderstandings/confusion

Post migration

In the post-deploy clean-up step, we remove public route handling from RESTBase, effectively turning it into the back-end storage service.

Comment from Giuseppe:

I think plan 1 is much simpler. It requires more patches and more attention to not leave anything behind, but it's probably the better plan. Please be mindful that restrouter will need to be terminating SSL as well, like restbase does.
I vote plan 1.

Marko:

  • my vote goes for plan 1 as well, even though it will probably take longer, it makes it clear to all parties involved that changes are happening; that means that also service owners will be more aware in case of problems so they will be easier to detect
  • i agree that the end result is better with restrouter.svc than restbase.svc

RESTRouter will effectively take over request handling from RESTBase, so we will need to divert traffic to it without interruption.

Benchmarking:

Event Timeline

mobrovac triaged this task as Normal priority.May 21 2019, 6:58 AM
mobrovac created this task.
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptMay 21 2019, 6:58 AM
Restricted Application edited projects, added Operations, Services; removed Services (next). · View Herald TranscriptMay 21 2019, 7:25 AM
mobrovac renamed this task from Deploy the RESTBase front-end service to Kubernetes to Deploy the RESTBase front-end service (RESTRouter) to Kubernetes.May 21 2019, 7:26 AM

PR #1141 adds the needed Blubber config.

Change 512923 had a related patch set uploaded (by Mobrovac; owner: Mobrovac):
[operations/deployment-charts@master] RESTRouter: Add initial Helm chart

https://gerrit.wikimedia.org/r/512923

mobrovac updated the task description. (Show Details)May 29 2019, 9:57 AM
fsero moved this task from Backlog to Goal tasks on the serviceops board.Jun 20 2019, 2:12 PM
akosiaris moved this task from Goal tasks to Backlog on the serviceops board.Jun 21 2019, 8:26 AM

Regarding the deployment plan, the main pain point is that we will need to have both front-end and back-end processes behind the same LVS and then slowly withdraw the back-end ones behind a separate LVS end point. As doing so while running all of them on the same port poses some challenges, @Joe and I came up with an interesting idea. For the case of the back-end service, we could have it bind to two different ports: the one eventually to be used by the front-end service and the one to be used by the back-end service. That way, we can shift processes from one LVS to the other seamlessly. On the implementation side, this can be achieved by levering service-runner's multiple services inside one process functionality: essentially, we can declare the back-end service twice in config.yaml, but assign it different ports:

services:
  - name: restbase
    module: hyperswitch
    conf: &rb_conf
      port: 7231
      spec: *spec_root
      salt: secret
      # blah blah
  - name: restbase
    module: hyperswitch
    conf:
      <<: *rb_conf
      port: 7233

This will make service-runner bind to both 7231 and 7233 but execute the same code paths for requests regardless on which port they are received.

Regarding the deployment plan, the main pain point is that we will need to have both front-end and back-end processes behind the same LVS and then slowly withdraw the back-end ones behind a separate LVS end point. As doing so while running all of them on the same port poses some challenges,

What challenges are we talking about here?

Note: Running multiple services will present certain challenges as well, mainly that we've never run such a configuration in production, so it seems a bit risky. However, if I correctly understand, the idea is to run RESTRouter on 7231 too, so we do need to switch. RESTBase to 7233 while still supporting existing traffic on 7231 on RESTBase. I guess there's no other way then to expose the same service on both ports...

Change 521572 had a related patch set uploaded (by Ppchelko; owner: Ppchelko):
[mediawiki/services/restbase/deploy@master] Expose both ports 7231 and 7233.

https://gerrit.wikimedia.org/r/521572

Change 521584 had a related patch set uploaded (by Ppchelko; owner: Ppchelko):
[operations/puppet@production] LVS for RESTRouter.

https://gerrit.wikimedia.org/r/521584

Change 512923 merged by Alexandros Kosiaris:
[operations/deployment-charts@master] RESTRouter: Add initial Helm chart

https://gerrit.wikimedia.org/r/512923

Change 522151 had a related patch set uploaded (by Alexandros Kosiaris; owner: Alexandros Kosiaris):
[operations/deployment-charts@master] Publish restrouter 0.0.1

https://gerrit.wikimedia.org/r/522151

Change 522151 merged by Alexandros Kosiaris:
[operations/deployment-charts@master] Publish restrouter 0.0.1

https://gerrit.wikimedia.org/r/522151

Change 526448 had a related patch set uploaded (by Alexandros Kosiaris; owner: Alexandros Kosiaris):
[operations/dns@master] Assign restrouter LVS IPs

https://gerrit.wikimedia.org/r/526448

Change 526449 had a related patch set uploaded (by Alexandros Kosiaris; owner: Alexandros Kosiaris):
[operations/dns@master] Activate restrouter discovery records

https://gerrit.wikimedia.org/r/526449

Change 526632 had a related patch set uploaded (by Alexandros Kosiaris; owner: Alexandros Kosiaris):
[operations/puppet@production] restrouter: Add kubernetes stanzas

https://gerrit.wikimedia.org/r/526632

Change 526719 had a related patch set uploaded (by Alexandros Kosiaris; owner: Alexandros Kosiaris):
[operations/deployment-charts@master] restrouter: Add helmfile stanzas

https://gerrit.wikimedia.org/r/526719

Change 527130 had a related patch set uploaded (by Alexandros Kosiaris; owner: Alexandros Kosiaris):
[operations/deployment-charts@master] restrouter: Switch to event_service_uri

https://gerrit.wikimedia.org/r/527130

Change 526719 merged by Alexandros Kosiaris:
[operations/deployment-charts@master] restrouter: Add helmfile stanzas

https://gerrit.wikimedia.org/r/526719

Change 527130 merged by Alexandros Kosiaris:
[operations/deployment-charts@master] restrouter: Switch to event_service_uri

https://gerrit.wikimedia.org/r/527130

Change 526632 merged by Alexandros Kosiaris:
[operations/puppet@production] restrouter: Add kubernetes stanzas

https://gerrit.wikimedia.org/r/526632

restrouter was temporarily deployed in the staging cluster today. Deployment was rolled back as it was failing, trying to reach out to restbase on port 7233, where restbase does not listen on yet. As soon as we figure out the exact details of the migration plan this should be ready to go. Those are

  • Restbase listening on port 7233 as well
  • Deciding the best plan on how to switchover the traffic (percentage based, per service based)

Change 521572 merged by Mobrovac:
[mediawiki/services/restbase/deploy@master] Expose both ports 7231 and 7233.

https://gerrit.wikimedia.org/r/521572

Mentioned in SAL (#wikimedia-operations) [2019-08-26T13:06:05Z] <mobrovac@deploy1001> Started deploy [restbase/deploy@38c313d]: Bring the dev cluster up to date and expose RB on both 7231 and 7233 in it - T223953

Mentioned in SAL (#wikimedia-operations) [2019-08-26T13:06:10Z] <mobrovac@deploy1001> deploy aborted: Bring the dev cluster up to date and expose RB on both 7231 and 7233 in it - T223953 (duration: 00m 04s)

Mentioned in SAL (#wikimedia-operations) [2019-08-26T13:06:26Z] <mobrovac@deploy1001> Started deploy [restbase/deploy@38c313d] (dev-cluster): Bring the dev cluster up to date and expose RB on both 7231 and 7233 in it - T223953

Mentioned in SAL (#wikimedia-operations) [2019-08-26T13:09:48Z] <mobrovac@deploy1001> Finished deploy [restbase/deploy@38c313d] (dev-cluster): Bring the dev cluster up to date and expose RB on both 7231 and 7233 in it - T223953 (duration: 03m 22s)

Mentioned in SAL (#wikimedia-operations) [2019-08-26T13:15:58Z] <mobrovac@deploy1001> Started deploy [restbase/deploy@38c313d]: Expose RB on both 7231 and 7233 - T223953

Mentioned in SAL (#wikimedia-operations) [2019-08-26T13:38:57Z] <mobrovac@deploy1001> Finished deploy [restbase/deploy@38c313d]: Expose RB on both 7231 and 7233 - T223953 (duration: 23m 00s)

Change 532382 had a related patch set uploaded (by Mobrovac; owner: Mobrovac):
[operations/puppet@production] RESTBase: Temporarily allow access to port 7233 as well

https://gerrit.wikimedia.org/r/532382

Change 534430 had a related patch set uploaded (by Alexandros Kosiaris; owner: Alexandros Kosiaris):
[operations/puppet@production] WIP: LVS: Setup port 7233 for restbase-backend

https://gerrit.wikimedia.org/r/534430

Change 532382 merged by Alexandros Kosiaris:
[operations/puppet@production] RESTBase: Temporarily allow access to port 7233 as well

https://gerrit.wikimedia.org/r/532382

akosiaris updated the task description. (Show Details)Tue, Sep 17, 9:36 AM