Migrate mobileapps to k8s
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	Joe
	Nov 9 2023, 8:22 AM

Description

We want to move the traffic for mobileapps from the on-prem api cluster to the mw-api-int cluster on kubernetes.

Given the amount of requests we're talking about, around 3k rps, this is a *large* chunk of our traffic - about 40% of all of our api calls still going to the on-premises cluster!

Calculating from our current usage on mw-api-ext (which might be wrong), we need about 1 replica per 20 rps to keep usage low enough (although I would argue we can live with an higher usage for mw-api-int), it would mean we need about 150 replicas. Assuming ~ 6 cores allocated to a single replica, that would mean we need about 20/22 servers to allocate it all.

I don't think it's doable to move over that amount of servers in one go; we should rather look into moving a portion of traffic and increase it progressively.

Envoy has tools to split traffic between different backends, and I think it should be the way to go.

Details

Other Assignee: kamila

Subject	Repo	Branch	Lines +/-
service catalog: remove mw-api-async-transition	operations/puppet	production	+0 -12
mobileapps: switch service discovery to k8s only	operations/deployment-charts	master	+1 -1
mobileapps: 90% to mw on k8s	operations/puppet	production	+1 -1
mw-api-int: replicas x1.3	operations/deployment-charts	master	+1 -1
mobileapps: move 20% of replicas to k8s	operations/deployment-charts	master	+2 -2
mobileapps: 75% to mw-on-k8s	operations/puppet	production	+1 -1
mw-api-int: replicas x125%	operations/deployment-charts	master	+1 -1
mobileapps: 60% to mw-api-int	operations/puppet	production	+1 -1
mw-api-int: increase replicas by 33%	operations/deployment-charts	master	+1 -1
mobileapps: 45% to mw on k8s	operations/puppet	production	+1 -1
mobileapps: 30% to mw-on-k8s	operations/puppet	production	+1 -1
mw-api-int: increase replicas by 50%	operations/deployment-charts	master	+1 -1
mobileapps: increase replicas to 114	operations/deployment-charts	master	+1 -1
mobileapps: 20% to mw-on-k8s	operations/puppet	production	+1 -1
mobileapps: switch 15% of traffic to mw-on-k8s	operations/puppet	production	+1 -1
mobileapps: switch to use the traffic percentage split endpoint	operations/deployment-charts	master	+3 -8
modules/mesh: add capability for traffic splitting	operations/deployment-charts	master	+119 -26
mw-api-int: double the number of replicas	operations/deployment-charts	master	+1 -1
mobileapps: move traffic to mw on k8s	operations/deployment-charts	master	+2 -2
mobileapps: switch canary to mw-on-k8s	operations/deployment-charts	master	+6 -1
mobileapps: add egress networkpolicy for mesh	operations/deployment-charts	master	+4 -5
mobileapps: introduce canary release	operations/deployment-charts	master	+16 -3

Related Objects
Search...

Status	Assigned	Task
Stalled	None	T255792 Quibble runs core:unit tests twice!
Open	None	T328919 Upgrade to PHPUnit 10
Open	None	T338103 Micro-optimize ApiResult::isMetadataKey with str_starts_with once we support PHP8+
Open	None	T328921 Drop PHP 7.4 support from MediaWiki
Stalled	None	T334726 Use return type `never` in Wikibase
Open	None	T328922 Drop PHP 8.0 support from MediaWiki
Stalled	None	T319055 Upgrade to psr/container 2.x
Stalled	Krinkle	T319432 Migrate WMF production from PHP 7.4 to PHP 8.1
Open	None	T291916 Tracking task for Bullseye migrations in production
Stalled	None	T356293 Migrate MW appservers' base images to bullseye
Open	None	T290536 Serve production traffic via Kubernetes
In Progress	Clement_Goubert	T333120 Migrate internal traffic to k8s
Resolved	Joe	T350846 Migrate mobileapps to k8s

Event Timeline

Joe created this task.Nov 9 2023, 8:22 AM

Restricted Application removed a project: Patch-For-Review. · View Herald TranscriptNov 9 2023, 8:22 AM

Joe claimed this task.Nov 9 2023, 8:23 AM

Joe updated the task description. (Show Details)

Joe updated Other Assignee, added: kamila.

Couldn't we just add another mobileapps release (like a canary) that connects to mw-api-int and scale that up slowly while scaling the existing one down? That would not require any envoy config patching

In T350846#9318457, @JMeybohm wrote:

Couldn't we just add another mobileapps release (like a canary) that connects to mw-api-int and scale that up slowly while scaling the existing one down? That would not require any envoy config patching

Sure, we could also do that, I just thought it's cleaner in more than one way to have this capability in our mesh :)

Krinkle unsubscribed.Nov 9 2023, 11:18 AM

Change 973179 had a related patch set uploaded (by Giuseppe Lavagetto; author: Giuseppe Lavagetto):

[operations/deployment-charts@master] mobileapps: introduce canary release

https://gerrit.wikimedia.org/r/973179

Change 973180 had a related patch set uploaded (by Giuseppe Lavagetto; author: Giuseppe Lavagetto):

[operations/deployment-charts@master] mobileapps: add egress networkpolicy for mesh

https://gerrit.wikimedia.org/r/973180

Change 973181 had a related patch set uploaded (by Giuseppe Lavagetto; author: Giuseppe Lavagetto):

[operations/deployment-charts@master] mobileapps: switch canary to mw-on-k8s

https://gerrit.wikimedia.org/r/973181

Change 973182 had a related patch set uploaded (by Giuseppe Lavagetto; author: Giuseppe Lavagetto):

[operations/deployment-charts@master] mobileapps: move traffic to mw on k8s

https://gerrit.wikimedia.org/r/973182

Change 973183 had a related patch set uploaded (by Giuseppe Lavagetto; author: Giuseppe Lavagetto):

[operations/deployment-charts@master] mw-api-int: double the number of replicas

https://gerrit.wikimedia.org/r/973183

Change 973184 had a related patch set uploaded (by Giuseppe Lavagetto; author: Giuseppe Lavagetto):

[operations/deployment-charts@master] mobileapps: move 20% of replicas to k8s

https://gerrit.wikimedia.org/r/973184

As it's clear from the patches, I chose to take the sage advice of @JMeybohm and go down the path of least resistance :)

Joe triaged this task as High priority.Nov 13 2023, 10:38 AM

Joe moved this task from Incoming 🐫 to Doing 😎 on the serviceops board.

I decided we should move about 10% of the mobileapps traffic at a time; that means about 300 rps, which I think we should be able to serve moving over about 2-3 api servers to become k8s nodes, or an additional 15 pods

Change 973179 merged by jenkins-bot:

[operations/deployment-charts@master] mobileapps: introduce canary release

https://gerrit.wikimedia.org/r/973179

Change 973180 merged by jenkins-bot:

[operations/deployment-charts@master] mobileapps: add egress networkpolicy for mesh

https://gerrit.wikimedia.org/r/973180

Change 973181 merged by jenkins-bot:

[operations/deployment-charts@master] mobileapps: switch canary to mw-on-k8s

https://gerrit.wikimedia.org/r/973181

Change 973182 merged by jenkins-bot:

[operations/deployment-charts@master] mobileapps: move traffic to mw on k8s

https://gerrit.wikimedia.org/r/973182

Change 973183 merged by jenkins-bot:

[operations/deployment-charts@master] mw-api-int: double the number of replicas

https://gerrit.wikimedia.org/r/973183

Clement_Goubert moved this task from Backlog to In Progress on the MW-on-K8s board.Nov 15 2023, 3:45 PM

Change 974991 had a related patch set uploaded (by Giuseppe Lavagetto; author: Giuseppe Lavagetto):

[operations/deployment-charts@master] modules/mesh: add capability for traffic splitting

https://gerrit.wikimedia.org/r/974991

Change 975816 had a related patch set uploaded (by Giuseppe Lavagetto; author: Giuseppe Lavagetto):

[operations/deployment-charts@master] mobileapps: switch to use the traffic percentage split endpoint

https://gerrit.wikimedia.org/r/975816

Change 974991 merged by jenkins-bot:

[operations/deployment-charts@master] modules/mesh: add capability for traffic splitting

https://gerrit.wikimedia.org/r/974991

As you might have noticed by the patches here, we've pivoted as traffic splitting to the canaries via kube-proxy converges over hours, not seconds which is what we'll need when we start raising our percentages.

Change 975816 merged by jenkins-bot:

[operations/deployment-charts@master] mobileapps: switch to use the traffic percentage split endpoint

https://gerrit.wikimedia.org/r/975816

Change 976218 had a related patch set uploaded (by Giuseppe Lavagetto; author: Giuseppe Lavagetto):

[operations/puppet@production] mobileapps: switch 15% of traffic to mw-on-k8s

https://gerrit.wikimedia.org/r/976218

Change 976219 had a related patch set uploaded (by Giuseppe Lavagetto; author: Giuseppe Lavagetto):

[operations/puppet@production] mobileapps: 20% to mw-on-k8s

https://gerrit.wikimedia.org/r/976219

Change 976220 had a related patch set uploaded (by Giuseppe Lavagetto; author: Giuseppe Lavagetto):

[operations/puppet@production] mobileapps: 30% to mw-on-k8s

https://gerrit.wikimedia.org/r/976220

Change 976221 had a related patch set uploaded (by Giuseppe Lavagetto; author: Giuseppe Lavagetto):

[operations/puppet@production] mobileapps: 45% to mw on k8s

https://gerrit.wikimedia.org/r/976221

Change 976222 had a related patch set uploaded (by Giuseppe Lavagetto; author: Giuseppe Lavagetto):

[operations/puppet@production] mobileapps: 60% to mw-api-int

https://gerrit.wikimedia.org/r/976222

Change 976223 had a related patch set uploaded (by Giuseppe Lavagetto; author: Giuseppe Lavagetto):

[operations/puppet@production] mobileapps: 75% to mw-on-k8s

https://gerrit.wikimedia.org/r/976223

Change 976224 had a related patch set uploaded (by Giuseppe Lavagetto; author: Giuseppe Lavagetto):

[operations/puppet@production] mobileapps: 90% to mw on k8s

https://gerrit.wikimedia.org/r/976224

Change 976218 merged by Giuseppe Lavagetto:

[operations/puppet@production] mobileapps: switch 15% of traffic to mw-on-k8s

https://gerrit.wikimedia.org/r/976218

Change 976219 merged by Kamila Součková:

[operations/puppet@production] mobileapps: 20% to mw-on-k8s

https://gerrit.wikimedia.org/r/976219

Change 977628 had a related patch set uploaded (by Kamila Součková; author: Kamila Součková):

[operations/deployment-charts@master] mobileapps: increase replicas to 114

https://gerrit.wikimedia.org/r/977628

Change 977628 abandoned by Kamila Součková:

[operations/deployment-charts@master] mobileapps: increase replicas to 114

Reason:

I am very confused and should be increasing mw-api-int, not this

https://gerrit.wikimedia.org/r/977628

Change 977683 had a related patch set uploaded (by Kamila Součková; author: Kamila Součková):

[operations/deployment-charts@master] mw-api-int: increase replicas by 50%

https://gerrit.wikimedia.org/r/977683

Change 977683 merged by jenkins-bot:

[operations/deployment-charts@master] mw-api-int: increase replicas by 50%

https://gerrit.wikimedia.org/r/977683

Change 976220 merged by Kamila Součková:

[operations/puppet@production] mobileapps: 30% to mw-on-k8s

https://gerrit.wikimedia.org/r/976220

KOfori moved this task from Backlog to Radar/Not for service by Traffic on the Traffic board.Dec 4 2023, 1:54 PM

Change 976221 merged by Kamila Součková:

[operations/puppet@production] mobileapps: 45% to mw on k8s

https://gerrit.wikimedia.org/r/976221

Change 980888 had a related patch set uploaded (by Kamila Součková; author: Kamila Součková):

[operations/deployment-charts@master] mw-api-int: increase replicas by 33%

https://gerrit.wikimedia.org/r/980888

Change 980888 merged by jenkins-bot:

[operations/deployment-charts@master] mw-api-int: increase replicas by 33%

https://gerrit.wikimedia.org/r/980888

Change 976222 merged by Kamila Součková:

[operations/puppet@production] mobileapps: 60% to mw-api-int

https://gerrit.wikimedia.org/r/976222

Change 983163 had a related patch set uploaded (by Kamila Součková; author: Kamila Součková):

[operations/deployment-charts@master] mw-api-int: replicas x125%

https://gerrit.wikimedia.org/r/983163

Change 983163 merged by jenkins-bot:

[operations/deployment-charts@master] mw-api-int: replicas x125%

https://gerrit.wikimedia.org/r/983163

Change 976223 merged by Kamila Součková:

[operations/puppet@production] mobileapps: 75% to mw-on-k8s

https://gerrit.wikimedia.org/r/976223

Change 973184 abandoned by Giuseppe Lavagetto:

[operations/deployment-charts@master] mobileapps: move 20% of replicas to k8s

Reason:

https://gerrit.wikimedia.org/r/973184

Change 987976 had a related patch set uploaded (by Kamila Součková; author: Kamila Součková):

[operations/deployment-charts@master] mw-api-int: replicas x1.3

https://gerrit.wikimedia.org/r/987976

Change 987976 merged by jenkins-bot:

[operations/deployment-charts@master] mw-api-int: replicas x1.3

https://gerrit.wikimedia.org/r/987976

Change 976224 merged by Kamila Součková:

[operations/puppet@production] mobileapps: 90% to mw on k8s

https://gerrit.wikimedia.org/r/976224

Change 991043 had a related patch set uploaded (by Kamila Součková; author: Kamila Součková):

[operations/deployment-charts@master] mobileapps: switch service discovery to k8s only

https://gerrit.wikimedia.org/r/991043

Change 991043 merged by jenkins-bot:

[operations/deployment-charts@master] mobileapps: switch service discovery to k8s only

https://gerrit.wikimedia.org/r/991043

Change 991394 had a related patch set uploaded (by Kamila Součková; author: Kamila Součková):

[operations/puppet@production] service catalog: remove mw-api-async-transition

https://gerrit.wikimedia.org/r/991394

All traffic is now going to k8s \o/

I will keep an eye on php workers saturation, but it should be fine, so I'm calling it resolved.

akosiaris mentioned this in T333120: Migrate internal traffic to k8s.Feb 20 2024, 4:58 PM

Migrate mobileapps to k8sClosed, ResolvedPublicActions

Description

Details

Related ObjectsSearch...

Event Timeline

Migrate mobileapps to k8s
Closed, ResolvedPublic
Actions

Related Objects
Search...