Page MenuHomePhabricator

Migrate mobileapps to k8s
Closed, ResolvedPublic

Description

We want to move the traffic for mobileapps from the on-prem api cluster to the mw-api-int cluster on kubernetes.

Given the amount of requests we're talking about, around 3k rps, this is a *large* chunk of our traffic - about 40% of all of our api calls still going to the on-premises cluster!

Calculating from our current usage on mw-api-ext (which might be wrong), we need about 1 replica per 20 rps to keep usage low enough (although I would argue we can live with an higher usage for mw-api-int), it would mean we need about 150 replicas. Assuming ~ 6 cores allocated to a single replica, that would mean we need about 20/22 servers to allocate it all.

I don't think it's doable to move over that amount of servers in one go; we should rather look into moving a portion of traffic and increase it progressively.

Envoy has tools to split traffic between different backends, and I think it should be the way to go.

Details

Other Assignee
kamila
SubjectRepoBranchLines +/-
operations/puppetproduction+0 -12
operations/deployment-chartsmaster+1 -1
operations/puppetproduction+1 -1
operations/deployment-chartsmaster+1 -1
operations/deployment-chartsmaster+2 -2
operations/puppetproduction+1 -1
operations/deployment-chartsmaster+1 -1
operations/puppetproduction+1 -1
operations/deployment-chartsmaster+1 -1
operations/puppetproduction+1 -1
operations/puppetproduction+1 -1
operations/deployment-chartsmaster+1 -1
operations/deployment-chartsmaster+1 -1
operations/puppetproduction+1 -1
operations/puppetproduction+1 -1
operations/deployment-chartsmaster+3 -8
operations/deployment-chartsmaster+119 -26
operations/deployment-chartsmaster+1 -1
operations/deployment-chartsmaster+2 -2
operations/deployment-chartsmaster+6 -1
operations/deployment-chartsmaster+4 -5
operations/deployment-chartsmaster+16 -3
Show related patches Customize query in gerrit

Event Timeline

Joe updated the task description. (Show Details)
Joe updated Other Assignee, added: kamila.

Couldn't we just add another mobileapps release (like a canary) that connects to mw-api-int and scale that up slowly while scaling the existing one down? That would not require any envoy config patching

Couldn't we just add another mobileapps release (like a canary) that connects to mw-api-int and scale that up slowly while scaling the existing one down? That would not require any envoy config patching

Sure, we could also do that, I just thought it's cleaner in more than one way to have this capability in our mesh :)

Change 973179 had a related patch set uploaded (by Giuseppe Lavagetto; author: Giuseppe Lavagetto):

[operations/deployment-charts@master] mobileapps: introduce canary release

https://gerrit.wikimedia.org/r/973179

Change 973180 had a related patch set uploaded (by Giuseppe Lavagetto; author: Giuseppe Lavagetto):

[operations/deployment-charts@master] mobileapps: add egress networkpolicy for mesh

https://gerrit.wikimedia.org/r/973180

Change 973181 had a related patch set uploaded (by Giuseppe Lavagetto; author: Giuseppe Lavagetto):

[operations/deployment-charts@master] mobileapps: switch canary to mw-on-k8s

https://gerrit.wikimedia.org/r/973181

Change 973182 had a related patch set uploaded (by Giuseppe Lavagetto; author: Giuseppe Lavagetto):

[operations/deployment-charts@master] mobileapps: move traffic to mw on k8s

https://gerrit.wikimedia.org/r/973182

Change 973183 had a related patch set uploaded (by Giuseppe Lavagetto; author: Giuseppe Lavagetto):

[operations/deployment-charts@master] mw-api-int: double the number of replicas

https://gerrit.wikimedia.org/r/973183

Change 973184 had a related patch set uploaded (by Giuseppe Lavagetto; author: Giuseppe Lavagetto):

[operations/deployment-charts@master] mobileapps: move 20% of replicas to k8s

https://gerrit.wikimedia.org/r/973184

As it's clear from the patches, I chose to take the sage advice of @JMeybohm and go down the path of least resistance :)

Joe triaged this task as High priority.Nov 13 2023, 10:38 AM
Joe moved this task from Incoming 🐫 to Doing 😎 on the serviceops board.

I decided we should move about 10% of the mobileapps traffic at a time; that means about 300 rps, which I think we should be able to serve moving over about 2-3 api servers to become k8s nodes, or an additional 15 pods

Change 973179 merged by jenkins-bot:

[operations/deployment-charts@master] mobileapps: introduce canary release

https://gerrit.wikimedia.org/r/973179

Change 973180 merged by jenkins-bot:

[operations/deployment-charts@master] mobileapps: add egress networkpolicy for mesh

https://gerrit.wikimedia.org/r/973180

Change 973181 merged by jenkins-bot:

[operations/deployment-charts@master] mobileapps: switch canary to mw-on-k8s

https://gerrit.wikimedia.org/r/973181

Change 973182 merged by jenkins-bot:

[operations/deployment-charts@master] mobileapps: move traffic to mw on k8s

https://gerrit.wikimedia.org/r/973182

Change 973183 merged by jenkins-bot:

[operations/deployment-charts@master] mw-api-int: double the number of replicas

https://gerrit.wikimedia.org/r/973183

Change 974991 had a related patch set uploaded (by Giuseppe Lavagetto; author: Giuseppe Lavagetto):

[operations/deployment-charts@master] modules/mesh: add capability for traffic splitting

https://gerrit.wikimedia.org/r/974991

Change 975816 had a related patch set uploaded (by Giuseppe Lavagetto; author: Giuseppe Lavagetto):

[operations/deployment-charts@master] mobileapps: switch to use the traffic percentage split endpoint

https://gerrit.wikimedia.org/r/975816

Change 974991 merged by jenkins-bot:

[operations/deployment-charts@master] modules/mesh: add capability for traffic splitting

https://gerrit.wikimedia.org/r/974991

As you might have noticed by the patches here, we've pivoted as traffic splitting to the canaries via kube-proxy converges over hours, not seconds which is what we'll need when we start raising our percentages.

Change 975816 merged by jenkins-bot:

[operations/deployment-charts@master] mobileapps: switch to use the traffic percentage split endpoint

https://gerrit.wikimedia.org/r/975816

Change 976218 had a related patch set uploaded (by Giuseppe Lavagetto; author: Giuseppe Lavagetto):

[operations/puppet@production] mobileapps: switch 15% of traffic to mw-on-k8s

https://gerrit.wikimedia.org/r/976218

Change 976219 had a related patch set uploaded (by Giuseppe Lavagetto; author: Giuseppe Lavagetto):

[operations/puppet@production] mobileapps: 20% to mw-on-k8s

https://gerrit.wikimedia.org/r/976219

Change 976220 had a related patch set uploaded (by Giuseppe Lavagetto; author: Giuseppe Lavagetto):

[operations/puppet@production] mobileapps: 30% to mw-on-k8s

https://gerrit.wikimedia.org/r/976220

Change 976221 had a related patch set uploaded (by Giuseppe Lavagetto; author: Giuseppe Lavagetto):

[operations/puppet@production] mobileapps: 45% to mw on k8s

https://gerrit.wikimedia.org/r/976221

Change 976222 had a related patch set uploaded (by Giuseppe Lavagetto; author: Giuseppe Lavagetto):

[operations/puppet@production] mobileapps: 60% to mw-api-int

https://gerrit.wikimedia.org/r/976222

Change 976223 had a related patch set uploaded (by Giuseppe Lavagetto; author: Giuseppe Lavagetto):

[operations/puppet@production] mobileapps: 75% to mw-on-k8s

https://gerrit.wikimedia.org/r/976223

Change 976224 had a related patch set uploaded (by Giuseppe Lavagetto; author: Giuseppe Lavagetto):

[operations/puppet@production] mobileapps: 90% to mw on k8s

https://gerrit.wikimedia.org/r/976224

Change 976218 merged by Giuseppe Lavagetto:

[operations/puppet@production] mobileapps: switch 15% of traffic to mw-on-k8s

https://gerrit.wikimedia.org/r/976218

Change 976219 merged by Kamila Součková:

[operations/puppet@production] mobileapps: 20% to mw-on-k8s

https://gerrit.wikimedia.org/r/976219

Change 977628 had a related patch set uploaded (by Kamila Součková; author: Kamila Součková):

[operations/deployment-charts@master] mobileapps: increase replicas to 114

https://gerrit.wikimedia.org/r/977628

Change 977628 abandoned by Kamila Součková:

[operations/deployment-charts@master] mobileapps: increase replicas to 114

Reason:

I am very confused and should be increasing mw-api-int, not this

https://gerrit.wikimedia.org/r/977628

Change 977683 had a related patch set uploaded (by Kamila Součková; author: Kamila Součková):

[operations/deployment-charts@master] mw-api-int: increase replicas by 50%

https://gerrit.wikimedia.org/r/977683

Change 977683 merged by jenkins-bot:

[operations/deployment-charts@master] mw-api-int: increase replicas by 50%

https://gerrit.wikimedia.org/r/977683

Change 976220 merged by Kamila Součková:

[operations/puppet@production] mobileapps: 30% to mw-on-k8s

https://gerrit.wikimedia.org/r/976220

Change 976221 merged by Kamila Součková:

[operations/puppet@production] mobileapps: 45% to mw on k8s

https://gerrit.wikimedia.org/r/976221

Change 980888 had a related patch set uploaded (by Kamila Součková; author: Kamila Součková):

[operations/deployment-charts@master] mw-api-int: increase replicas by 33%

https://gerrit.wikimedia.org/r/980888

Change 980888 merged by jenkins-bot:

[operations/deployment-charts@master] mw-api-int: increase replicas by 33%

https://gerrit.wikimedia.org/r/980888

Change 976222 merged by Kamila Součková:

[operations/puppet@production] mobileapps: 60% to mw-api-int

https://gerrit.wikimedia.org/r/976222

Change 983163 had a related patch set uploaded (by Kamila Součková; author: Kamila Součková):

[operations/deployment-charts@master] mw-api-int: replicas x125%

https://gerrit.wikimedia.org/r/983163

Change 983163 merged by jenkins-bot:

[operations/deployment-charts@master] mw-api-int: replicas x125%

https://gerrit.wikimedia.org/r/983163

Change 976223 merged by Kamila Součková:

[operations/puppet@production] mobileapps: 75% to mw-on-k8s

https://gerrit.wikimedia.org/r/976223

Change 973184 abandoned by Giuseppe Lavagetto:

[operations/deployment-charts@master] mobileapps: move 20% of replicas to k8s

Reason:

https://gerrit.wikimedia.org/r/973184

Change 987976 had a related patch set uploaded (by Kamila Součková; author: Kamila Součková):

[operations/deployment-charts@master] mw-api-int: replicas x1.3

https://gerrit.wikimedia.org/r/987976

Change 987976 merged by jenkins-bot:

[operations/deployment-charts@master] mw-api-int: replicas x1.3

https://gerrit.wikimedia.org/r/987976

Change 976224 merged by Kamila Součková:

[operations/puppet@production] mobileapps: 90% to mw on k8s

https://gerrit.wikimedia.org/r/976224

Change 991043 had a related patch set uploaded (by Kamila Součková; author: Kamila Součková):

[operations/deployment-charts@master] mobileapps: switch service discovery to k8s only

https://gerrit.wikimedia.org/r/991043

Change 991043 merged by jenkins-bot:

[operations/deployment-charts@master] mobileapps: switch service discovery to k8s only

https://gerrit.wikimedia.org/r/991043

Change 991394 had a related patch set uploaded (by Kamila Součková; author: Kamila Součková):

[operations/puppet@production] service catalog: remove mw-api-async-transition

https://gerrit.wikimedia.org/r/991394

kamila subscribed.

All traffic is now going to k8s \o/

I will keep an eye on php workers saturation, but it should be fine, so I'm calling it resolved.