Page MenuHomePhabricator

Migrate mobileapps to k8s and node 10
Closed, ResolvedPublic

Description

To migrate mobileapps to Node.js 10 we will need to do the following:

  • Update the blubberfile to use node 10 base images
  • Benchmark the service
  • Write a Helm chart
  • Create kubernetes tokens @akosiaris will do that
  • Create kubernetes namespace @akosiaris will do that
  • Deploy
  • Switch traffic to the new deployment
  • Switch LVS conftool to the kubernetes cluster
  • Remove mobileapps from scb conftool config
  • Remove mobileapps from scb
  • Remove mobileapps from puppet

Details

SubjectRepoBranchLines +/-
operations/puppetproduction+0 -45
operations/puppetproduction+3 -3
operations/puppetproduction+42 -42
operations/puppetproduction+2 -2
operations/deployment-chartsmaster+2 -2
operations/deployment-chartsmaster+304 -304
operations/deployment-chartsmaster+18 -0
operations/deployment-chartsmaster+2 -2
operations/deployment-chartsmaster+323 -303
operations/deployment-chartsmaster+322 -302
operations/deployment-chartsmaster+329 -326
operations/puppetproduction+35 -11
operations/deployment-chartsmaster+36 -0
operations/deployment-chartsmaster+121 -0
operations/deployment-chartsmaster+85 -3
operations/puppetproduction+42 -0
labs/privatemaster+24 -0
operations/deployment-chartsmaster+549 -0
mediawiki/services/mobileappsmaster+1 -0
mediawiki/services/mobileappsmaster+2 -2
mediawiki/services/mobileappsmaster+28 -0
Show related patches Customize query in gerrit

Related Objects

StatusSubtypeAssignedTask
StalledNone
ResolvedNone
Resolvedakosiaris
ResolvedJdforrester-WMF
ResolvedJdforrester-WMF
ResolvedReedy
ResolvedReedy
ResolvedBawolff
ResolvedAnomie
ResolvedBawolff
ResolvedBawolff
ResolvedLegoktm
ResolvedLucas_Werkmeister_WMDE
ResolvedBawolff
Resolvedsbassett
Resolvedsbassett
ResolvedJdforrester-WMF
Resolvedsbassett
Resolvedsbassett
ResolvedReedy
ResolvedReedy
ResolvedJdforrester-WMF
ResolvedReedy
ResolvedReedy
ResolvedReedy
ResolvedJdforrester-WMF
ResolvedJdforrester-WMF
ResolvedReedy
ResolvedReedy
ResolvedReedy
ResolvedJdforrester-WMF
Resolvedhashar
Resolvedhashar
ResolvedJdforrester-WMF
Resolvedhashar
DeclinedMoritzMuehlenhoff
Invalidthcipriani
Resolved mmodell
Resolvedhashar
ResolvedJoe
ResolvedJMeybohm
ResolvedJMeybohm
DuplicateDzahn
DeclinedDzahn
ResolvedJdforrester-WMF
OpenNone
OpenNone
ResolvedJdforrester-WMF
Resolvedakosiaris
DeclinedNone
Resolved Mholloway

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

Change 570162 merged by Alexandros Kosiaris:
[operations/deployment-charts@master] Add chart for mobileapps

https://gerrit.wikimedia.org/r/570162

Change 580294 had a related patch set uploaded (by Alexandros Kosiaris; owner: Alexandros Kosiaris):
[labs/private@master] Add k8s dummy tokens for 3 new services.

https://gerrit.wikimedia.org/r/580294

Change 580295 had a related patch set uploaded (by Alexandros Kosiaris; owner: Alexandros Kosiaris):
[operations/puppet@production] Kubernetes: Create token stanzas for some new services

https://gerrit.wikimedia.org/r/580295

Change 599812 had a related patch set uploaded (by Alexandros Kosiaris; owner: Alexandros Kosiaris):
[operations/deployment-charts@master] Create namespaces/calico rules for new services

https://gerrit.wikimedia.org/r/599812

Change 580294 merged by Alexandros Kosiaris:
[labs/private@master] Add k8s dummy tokens for 3 new services.

https://gerrit.wikimedia.org/r/580294

Change 580295 merged by Alexandros Kosiaris:
[operations/puppet@production] Kubernetes: Create token stanzas for some new services

https://gerrit.wikimedia.org/r/580295

Change 599812 merged by jenkins-bot:
[operations/deployment-charts@master] Create namespaces/calico rules for new services

https://gerrit.wikimedia.org/r/599812

@Mholloway, @bearND namespaces, rules, tokens have been created. Chart has been merged and publish. You are free to deploy. You will require a change like 968132909b4d24192b2f69a657c14bb30acd7a42 in order to instantiate the first deploy, feel free to add me as a reviewer.

Awesome! Thanks, @akosiaris! I'll get that change going shortly.

Maybe I'm getting ahead of myself, but thinking ahead to switching traffic over to the k8s mobileapps deployment, I'm thinking it would be best to switch traffic over in stages. Could we create a new service discovery domain, something like mobileapps-docker.discovery.wmnet, to use to point certain requests to the new k8s deployment during the switchover process?

Change 602155 had a related patch set uploaded (by Mholloway; owner: Michael Holloway):
[operations/deployment-charts@master] Mobileapps: Add initial helmfile stanzas

https://gerrit.wikimedia.org/r/602155

Awesome! Thanks, @akosiaris! I'll get that change going shortly.

Maybe I'm getting ahead of myself, but thinking ahead to switching traffic over to the k8s mobileapps deployment, I'm thinking it would be best to switch traffic over in stages. Could we create a new service discovery domain, something like mobileapps-docker.discovery.wmnet, to use to point certain requests to the new k8s deployment during the switchover process?

Before answering the question let me paint a picture of the plan that SRE wants to follow for switching the traffic to the k8s mobileapps deployment, in case that helps. It looks more or less like this:

  • Get mobileapps deployed on k8s
  • Ensure all automated healthchecks pass in exactly the same manner as the current deployment (this is more or less automated and based on the exact same healthchecks that pass on the old scb cluster and is based on the openapi/swagger spec the software has).
  • Make sure whatever other manual checks (if any) succeed as well.
  • Add the k8s deployment to the traffic pool, albeit with inactive status.
  • Set very low weights for the k8s endpoints and enable them in the traffic pool
  • Pick the inactive datacenter, codfw in this case.
  • Slowly over multiple days increase the weight of the k8s endpoints so that traffic moves from scb to k8s. The exact increments can be subject to discussion, up to now (we 've done this multiple times) the usual steps are: ~1%, ~2%, ~5%, ~10%, ~25%, ~50%, ~75%, 100%. The ~ is because the weights are integers and good precision isn't possible, so you may see something like 78% instead of 75% but the ballpark goal remains the same. Also, depending on how well this goes, we may add steps or skip steps, but overall we tend to follow the approach above.
  • Once scb serves 0%, remove it from the pool.
  • Repeat with the above 2 bullet points with the active datacenter, eqiad in this case.
  • Wait out multiple days in case an emergency rollback is warranted.
  • Undeploy from scb.

Now, to answer your question:

As you can tell, this is a quantitative approach across the entirety of the HTTP endpoints. Which means that is you want to move over specific HTTP endpoints only it isn't possible. If that's your desire, we can probably create some temporary mobileapps-migration.svc.${dc}.wmnet DNS record, with the caveat that any callers' configuration (e.g. restbase) would have to be updated at least twice (one for the migration to the temporary DNS record and once more back to the normal one).

I would be wary of creating a service discovery record for the migration itself as discovery records are about pooling/depooling datacenters from the service pool, which adds complexity to the process I don't see a reason for yet (but please correct me, I might have missed something).

Thanks for that explanation, @akosiaris, that process sounds great. Of course with mobileapps the outlier in terms of traffic is the /page/summary endpoint, but given the cautious approach you've described in terms of increasing percentages, I don't think we need (or want) to try to do anything like migrate /page/summary over separately. Does the plan sound good to you, @bearND?

Yes, this plan sounds good to me. I think we should probably try to deploy both SCB and k8s instances at roughly the same times during this transition period.

If you want so start with TLS (via envoy) right away (which would be great!) you need to go through the extra steps of generating certificates (current document draft at https://wikitech.wikimedia.org/wiki/User:Giuseppe_Lavagetto/Add_Tls_On_Kubernetes) and "register" a TCP port at https://wikitech.wikimedia.org/wiki/Service_ports

Thanks, @JMeybohm. Using TLS right away sounds great! I've reserved port 4102 for mobileapps TLS (as well as 4103 for chromium-render TLS) but I don't believe either @bearND or I have the access to the puppet private repo to generate certificates.

Oh, my bad. Then we'll create them for you ofc.
Unfortunately starting with TLS right away would not permit the gradual traffic shift Alex was suggesting so it's probably better to start without and migrate to TLS in a second step. :-/

Oh, my bad. Then we'll create them for you ofc.
Unfortunately starting with TLS right away would not permit the gradual traffic shift Alex was suggesting so it's probably better to start without and migrate to TLS in a second step. :-/

+1. Yes, let's decouple those 2 steps.

Change 602155 merged by jenkins-bot:
[operations/deployment-charts@master] Mobileapps: Add initial helmfile stanzas

https://gerrit.wikimedia.org/r/602155

Change 612273 had a related patch set uploaded (by Alexandros Kosiaris; owner: Alexandros Kosiaris):
[operations/deployment-charts@master] mobileapps: Add a temporary non-TLS release

https://gerrit.wikimedia.org/r/612273

Change 612567 had a related patch set uploaded (by Alexandros Kosiaris; owner: Alexandros Kosiaris):
[operations/puppet@production] mobileapps: Add kubernetes nodes next to scb nodes

https://gerrit.wikimedia.org/r/612567

Change 612273 merged by jenkins-bot:
[operations/deployment-charts@master] mobileapps: Add a temporary non-TLS release

https://gerrit.wikimedia.org/r/612273

Change 612567 merged by Alexandros Kosiaris:
[operations/puppet@production] mobileapps: Add kubernetes nodes next to scb nodes

https://gerrit.wikimedia.org/r/612567

Mentioned in SAL (#wikimedia-operations) [2020-07-15T14:12:00Z] <akosiaris> increase codfw mobileapps kubernetes traffic to 2% T218733

Change 612959 had a related patch set uploaded (by Alexandros Kosiaris; owner: Alexandros Kosiaris):
[operations/deployment-charts@master] mobileapps: Amend statsd exporter buckets

https://gerrit.wikimedia.org/r/612959

Change 612959 merged by jenkins-bot:
[operations/deployment-charts@master] mobileapps: Amend statsd exporter buckets

https://gerrit.wikimedia.org/r/612959

Change 612966 had a related patch set uploaded (by Alexandros Kosiaris; owner: Alexandros Kosiaris):
[operations/deployment-charts@master] mobileapps: fix prometheus statsd exporter issue in 0.0.10

https://gerrit.wikimedia.org/r/612966

Change 612966 merged by jenkins-bot:
[operations/deployment-charts@master] mobileapps: fix prometheus statsd exporter issue in 0.0.10

https://gerrit.wikimedia.org/r/612966

Mentioned in SAL (#wikimedia-operations) [2020-07-16T12:35:58Z] <akosiaris> increase codfw mobileapps kubernetes traffic to 5% T218733

Mentioned in SAL (#wikimedia-operations) [2020-07-16T13:02:13Z] <akosiaris> increase codfw mobileapps kubernetes traffic to 10% T218733

Mentioned in SAL (#wikimedia-operations) [2020-07-16T13:36:06Z] <akosiaris> increase codfw mobileapps kubernetes traffic to 25% T218733

Mentioned in SAL (#wikimedia-operations) [2020-07-16T15:15:30Z] <akosiaris> lower codfw mobileapps kubernetes traffic to 10% T218733. Will open up task for it

Mentioned in SAL (#wikimedia-operations) [2020-07-20T16:27:42Z] <akosiaris> increase codfw mobileapps kubernetes traffic to 25% T218733. Take #2

Mentioned in SAL (#wikimedia-operations) [2020-07-21T08:37:36Z] <akosiaris> increase codfw mobileapps kubernetes traffic to 47% T218733

Mentioned in SAL (#wikimedia-operations) [2020-07-21T09:58:01Z] <akosiaris> increase codfw mobileapps kubernetes traffic to 72.727272% T218733

Mentioned in SAL (#wikimedia-operations) [2020-07-21T09:59:48Z] <akosiaris> move all codfw mobileapps nodes (kubernetes and scb) to weight 10. Traffic level remains at 72.727272% flowing to kubernetes, the rest to scb T218733

Mentioned in SAL (#wikimedia-operations) [2020-07-21T13:41:57Z] <akosiaris> increase codfw mobileapps kubernetes traffic to 96% T218733

Mentioned in SAL (#wikimedia-operations) [2020-07-22T08:16:01Z] <akosiaris> increase codfw mobileapps kubernetes traffic to 96% T218733. Take #2. Let's see if I can reproduce the weird increases in p99 latencies and figure out their cause

Change 615416 had a related patch set uploaded (by Alexandros Kosiaris; owner: Alexandros Kosiaris):
[operations/deployment-charts@master] mobileapps: Bump memory limit by 25%

https://gerrit.wikimedia.org/r/615416

Change 615416 merged by jenkins-bot:
[operations/deployment-charts@master] mobileapps: Bump memory limit by 25%

https://gerrit.wikimedia.org/r/615416

Mentioned in SAL (#wikimedia-operations) [2020-07-22T09:25:48Z] <akosiaris> bump memory limits for mobileapps by 25% T218733

Mentioned in SAL (#wikimedia-operations) [2020-07-22T09:40:45Z] <akosiaris> increase codfw mobileapps kubernetes traffic to 100% T218733

Mentioned in SAL (#wikimedia-operations) [2020-07-22T09:46:31Z] <akosiaris> codfw mobileapps kubernetes traffic back to 96% T218733 again. scb pooled again.

Mentioned in SAL (#wikimedia-operations) [2020-07-22T09:55:37Z] <akosiaris> bump memory in codfw mobileapps another 20% T218733

Change 615484 had a related patch set uploaded (by Alexandros Kosiaris; owner: Alexandros Kosiaris):
[operations/deployment-charts@master] mobileapps: Bump 20x the number of replicas

https://gerrit.wikimedia.org/r/615484

Change 615484 merged by jenkins-bot:
[operations/deployment-charts@master] mobileapps: Bump 20x the number of replicas

https://gerrit.wikimedia.org/r/615484

Change 615494 had a related patch set uploaded (by Alexandros Kosiaris; owner: Alexandros Kosiaris):
[operations/deployment-charts@master] mobileapps: Bump quotas

https://gerrit.wikimedia.org/r/615494

Change 615494 merged by jenkins-bot:
[operations/deployment-charts@master] mobileapps: Bump quotas

https://gerrit.wikimedia.org/r/615494

Change 615671 had a related patch set uploaded (by Alexandros Kosiaris; owner: Alexandros Kosiaris):
[operations/deployment-charts@master] mobileapps: Bump memory limits another 20%

https://gerrit.wikimedia.org/r/615671

Change 615672 had a related patch set uploaded (by Alexandros Kosiaris; owner: Alexandros Kosiaris):
[operations/deployment-charts@master] mobileapps: Lower replicas to 80 from 240

https://gerrit.wikimedia.org/r/615672

Change 615671 merged by jenkins-bot:
[operations/deployment-charts@master] mobileapps: Bump memory limits another 20%

https://gerrit.wikimedia.org/r/615671

Change 615672 merged by jenkins-bot:
[operations/deployment-charts@master] mobileapps: Lower replicas to 80 from 240

https://gerrit.wikimedia.org/r/615672

Mentioned in SAL (#wikimedia-operations) [2020-07-23T09:19:56Z] <akosiaris> lower replica count back to 80 for mobileapps. T218733

Mentioned in SAL (#wikimedia-operations) [2020-07-23T09:51:53Z] <akosiaris> prepare for pooling kubernetes mobileapps capacity in eqiad. T218733

Mentioned in SAL (#wikimedia-operations) [2020-07-23T10:11:43Z] <akosiaris> poole kubernetes in mobileapps/eqiad. T218733

Mentioned in SAL (#wikimedia-operations) [2020-07-23T11:18:01Z] <akosiaris> depool scb in mobileapps/eqiad. T218733

akosiaris lowered the priority of this task from High to Low.Jul 23 2020, 11:32 AM
akosiaris updated the task description. (Show Details)

Traffic has been switched fully for the last 18hours in codfw and for the last 1.5 hours in eqiad. I 've added a few followup items on the task for SRE, but otherwise this is looking good.

Change 618485 had a related patch set uploaded (by Alexandros Kosiaris; owner: Alexandros Kosiaris):
[operations/puppet@production] mobileapps: Switch conftool to kubernetes/kubesvc

https://gerrit.wikimedia.org/r/618485

Change 618486 had a related patch set uploaded (by Alexandros Kosiaris; owner: Alexandros Kosiaris):
[operations/puppet@production] mobileapps: Remove from scb conftool config

https://gerrit.wikimedia.org/r/618486

Change 618487 had a related patch set uploaded (by Alexandros Kosiaris; owner: Alexandros Kosiaris):
[operations/puppet@production] mobileapps: Remove mobileapps from scb

https://gerrit.wikimedia.org/r/618487

Change 618488 had a related patch set uploaded (by Alexandros Kosiaris; owner: Alexandros Kosiaris):
[operations/puppet@production] mobileapps: Remove the profile and the role

https://gerrit.wikimedia.org/r/618488

Change 618485 merged by Alexandros Kosiaris:
[operations/puppet@production] mobileapps: Switch conftool to kubernetes/kubesvc

https://gerrit.wikimedia.org/r/618485

Change 618486 merged by Alexandros Kosiaris:
[operations/puppet@production] mobileapps: Remove from scb conftool config

https://gerrit.wikimedia.org/r/618486

Change 618487 merged by Alexandros Kosiaris:
[operations/puppet@production] mobileapps: Remove mobileapps from scb

https://gerrit.wikimedia.org/r/618487

Change 618488 merged by Alexandros Kosiaris:
[operations/puppet@production] mobileapps: Remove the profile and the role

https://gerrit.wikimedia.org/r/618488

akosiaris claimed this task.
akosiaris updated the task description. (Show Details)

Resolving. Final puppet related mobileapps pieces have been cleanedup from scb and the repo \o/