Page MenuHomePhabricator

Migrate internal traffic to k8s
Open, In Progress, MediumPublic

Description

We need to progressively migrate traffic from our services to call the api in the mw-api-int cluster on k8s.

Right now we have (via this thanos query:

  • Mobileapps making 3k rps to the mediawiki API (!!!) <- Moved in 2nd stage
  • restbase making 600 rps <- Moved in 2nd stage
  • ores making 75-100 rps <- Deprecated
  • wikifeeds making ~ 70 rps <- Moved in 2nd stage
  • flink making ~ 40 rps <- Moved in 2nd stage

Everything else is basically marginal.

I propose we start moving all services on kubernetes to use mw-api-int now, with the exception of the ones named above.

Kubernetes services calling mediawiki

Related Objects

StatusSubtypeAssignedTask
StalledNone
OpenNone
OpenNone
OpenNone
StalledNone
OpenNone
StalledNone
StalledFeatureNone
StalledKrinkle
OpenNone
OpenNone
StalledNone
OpenNone
In ProgressClement_Goubert
ResolvedClement_Goubert
ResolvedClement_Goubert
ResolvedClement_Goubert
InvalidClement_Goubert
ResolvedJoe
ResolvedClement_Goubert
ResolvedClement_Goubert
ResolvedClement_Goubert
ResolvedClement_Goubert
ResolvedJoe
ResolvedJoe
ResolvedJoe
ResolvedJMeybohm
ResolvedJoe
ResolvedClement_Goubert
ResolvedClement_Goubert
ResolvedClement_Goubert
DeclinedClement_Goubert
ResolvedClement_Goubert
Resolvedelukey

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

Change 904061 had a related patch set uploaded (by Clément Goubert; author: Clément Goubert):

[operations/puppet@production] service_catalog: Add mw-api-int k8s service - 3

https://gerrit.wikimedia.org/r/904061

Change 904065 had a related patch set uploaded (by Clément Goubert; author: Clément Goubert):

[operations/dns@master] mw-api-int: add geo and metafo records

https://gerrit.wikimedia.org/r/904065

Mentioned in SAL (#wikimedia-operations) [2023-03-29T09:57:03Z] <claime> Adding mw-api-int to service_catalog in service_setup - T333120

Change 903217 merged by Clément Goubert:

[operations/puppet@production] service_catalog: Add mw-api-int k8s service - 1

https://gerrit.wikimedia.org/r/903217

Mentioned in SAL (#wikimedia-operations) [2023-03-29T09:58:56Z] <claime> running puppet on O:kubernetes::worker and O:lvs::balancer - T333120

Mentioned in SAL (#wikimedia-operations) [2023-03-29T10:37:09Z] <claime> Switching mw-api-int to lvs_setup - T333120

Change 904060 merged by Clément Goubert:

[operations/puppet@production] service_catalog: Add mw-api-int k8s service - 2

https://gerrit.wikimedia.org/r/904060

Mentioned in SAL (#wikimedia-operations) [2023-03-29T10:41:03Z] <cgoubert@cumin1001> START - Cookbook sre.loadbalancer.restart-pybal rolling-restart of pybal on P{lvs1020*,lvs2010*} and A:lvs (T333120)

Mentioned in SAL (#wikimedia-operations) [2023-03-29T10:42:59Z] <cgoubert@cumin1001> END (PASS) - Cookbook sre.loadbalancer.restart-pybal (exit_code=0) rolling-restart of pybal on P{lvs1020*,lvs2010*} and A:lvs (T333120)

Mentioned in SAL (#wikimedia-operations) [2023-03-29T10:46:23Z] <cgoubert@cumin1001> START - Cookbook sre.loadbalancer.restart-pybal rolling-restart of pybal on P{lvs1019*,lvs2009*} and A:lvs (T333120)

Mentioned in SAL (#wikimedia-operations) [2023-03-29T10:49:17Z] <cgoubert@cumin1001> END (PASS) - Cookbook sre.loadbalancer.restart-pybal (exit_code=0) rolling-restart of pybal on P{lvs1019*,lvs2009*} and A:lvs (T333120)

Mentioned in SAL (#wikimedia-operations) [2023-03-29T10:50:22Z] <claime> START - Cookbook sre.loadbalancer.restart-pybal rolling-restart of pybal on P{lvs1019*,lvs2009*} and A:lvs (T333120)

Mentioned in SAL (#wikimedia-operations) [2023-03-29T10:50:57Z] <claime> Switching mw-api-int to production - T333120

Change 904061 merged by Clément Goubert:

[operations/puppet@production] service_catalog: Add mw-api-int k8s service - 3

https://gerrit.wikimedia.org/r/904061

Mentioned in SAL (#wikimedia-operations) [2023-03-29T10:52:30Z] <claime> Running puppet on dns-auth - T333120

Change 904065 merged by Clément Goubert:

[operations/dns@master] mw-api-int: add discovery records

https://gerrit.wikimedia.org/r/904065

Mentioned in SAL (#wikimedia-operations) [2023-03-29T10:58:30Z] <claime> authdns-update successful on all nodes - T333120

mw-api-int and mw-api-int-ro services now in production, we can proceed with creating the envoy listeners in https://gerrit.wikimedia.org/r/c/operations/puppet/+/903595/ and then switching services to use them.

Change 903595 merged by Clément Goubert:

[operations/puppet@production] P:services_proxy::envoy: Add mw-api-int

https://gerrit.wikimedia.org/r/903595

Change 908542 had a related patch set uploaded (by Clément Goubert; author: Clément Goubert):

[operations/deployment-charts@master] admin_ng: Add mw-on-k8s Egress rules

https://gerrit.wikimedia.org/r/908542

Change 908542 merged by jenkins-bot:

[operations/deployment-charts@master] admin_ng: Add mw-on-k8s Egress rules

https://gerrit.wikimedia.org/r/908542

Change 908553 had a related patch set uploaded (by Clément Goubert; author: Clément Goubert):

[operations/deployment-charts@master] cxserver: Add mesh egress

https://gerrit.wikimedia.org/r/908553

Change 908553 merged by jenkins-bot:

[operations/deployment-charts@master] cxserver: Add mesh egress

https://gerrit.wikimedia.org/r/908553

Change #1043062 had a related patch set uploaded (by Clément Goubert; author: Clément Goubert):

[operations/puppet@production] mediawiki: Switch backend calls to mw-api-int

https://gerrit.wikimedia.org/r/1043062

Change #1043062 merged by Clément Goubert:

[operations/puppet@production] mediawiki: Switch backend calls to mw-api-int

https://gerrit.wikimedia.org/r/1043062

Mentioned in SAL (#wikimedia-operations) [2024-06-13T14:15:25Z] <cgoubert@deploy1002> Started scap: Change mwapi listener to mw-api-int - T333120

Mentioned in SAL (#wikimedia-operations) [2024-06-13T14:21:24Z] <cgoubert@deploy1002> Finished scap: Change mwapi listener to mw-api-int - T333120 (duration: 06m 47s)

Looks like this is now done except for "some straggling traffic" for the api-gateway?

image.png (958×2 px, 146 KB)

Yes, but I will close it when I'm sure I have zero internal traffic on the bare metal clusters.

I believe the straggling traffic here is a misnomer/a graph misunderstanding - the API gateway's envoy config refers to traffic to the mediawiki API as "mwapi_cluster" internally before and after the hostname was changed. I believe these requests are normal and are being routed to k8s already - drop mwapi_cluster from the graph and we're at zero! 🎉