We are splitting RESTBase in two components - the (public) REST API router and the storage service (cf. T220449: Split RESTBase in two services: storage service and API router/proxy). This task is about deploying the front-end REST router in Kubernetes.
Service name: RESTRouter (name still under discussion, cf. T220761)
Owners: @Pchelolo and @mobrovac (Core Platform Team)
ETA: by the end of Q4 FY18/19
Description: RESTRouter is the routing part of (the current) RESTBase. It accepts external requests, validates them (performs access checks if needed) and performs all the business logic related to the request: it looks up the storage for possible data hits and, if needed, issues requests to back-end services to complete the requests, sending the response to storage prior to returning it to the client.
Restrouter migration plans. Some parts are the same for all plans. Those are listed below
- First, we deploy RESTRouter to k8s.
- we expose the storage routes in RESTBase (cf. PR #1103)
- test RESTRouter for load (options include synthetic traffic, mirroring, using only background updates/internal requests).
Have restbase listen on both 7231 and 7233 and configure LVS restbase.svc.$::site.wmnet to also use 7233
Instantiate restrouter on a new LVS IP and DNS (restrouter.svc.$::site.wmnet) and have it talk to restbase.svc.$::site.wmnet:7233
Move services 1 by 1 to restrouter.discovery.wmnet (the site aware discovery records for restrouter.svc.$::site.wmnet)
- Move is gradual on a service level. Services are migrated one by one based on their configuration unearthing potential problems one by one
- The currently stable and battle tested restbase installation is kept around even while more and more services are moved around
- It's rather easy configuration wise, rather easy to do in steps
- No downtime for services.
- The migration might take time as when issues arise, but at least blockers will be service specific
- There is no gradual traffic switchover. For every service it's a "canary host first", then all or nothing approach. Even the canary host is depending on DC between 13% and 25% of traffic
Have restbase listen on both 7231 and 7233
Add a new LVS IP on the restbase hosts and name it restbase-backend.svc.$::site.wmnet
Configure restrouter to connect to restbase-backend.svc.$::site.wmnet:7233
Add the LVS IP for restbase.svc.$::site.wmnet to kubernetes hosts
Add the kubernetes hosts to LVS for restbase.svc.$::site.wmnet
Slowly migrate the traffic from the current restbase hosts to kubernetes hosts
- The services see 0 changes. Everything happens transparently to them.
- The move of traffic is gradual allowing to rollback quickly and easily, as well as pause the migration
- No downtime for services
- Rather convoluted configuration wise, with some margin for mistakes
- All or nothing approach as far as services go. No way to distinguish between them
- The migration might take a long time as when issues arise they will probably be global blockers for all services
- Rollbacks are possible, but if issues arise, it's probably going to be a full rollback to the old installation
- The ending restbase.svc.$::site.wmnet DNS does not reflect the actual software powering the frontend, aka restrouter possibly leading to future misunderstandings/confusion
In the post-deploy clean-up step, we remove public route handling from RESTBase, effectively turning it into the back-end storage service.
Comment from Giuseppe:
I think plan 1 is much simpler. It requires more patches and more attention to not leave anything behind, but it's probably the better plan. Please be mindful that restrouter will need to be terminating SSL as well, like restbase does.
I vote plan 1.
- my vote goes for plan 1 as well, even though it will probably take longer, it makes it clear to all parties involved that changes are happening; that means that also service owners will be more aware in case of problems so they will be easier to detect
- i agree that the end result is better with restrouter.svc than restbase.svc
RESTRouter will effectively take over request handling from RESTBase, so we will need to divert traffic to it without interruption.