Page MenuHomePhabricator

Migrate internal traffic to k8s
Closed, ResolvedPublic

Description

We need to progressively migrate traffic from our services to call the api in the mw-api-int cluster on k8s.

Right now we have (via this thanos query:

  • Mobileapps making 3k rps to the mediawiki API (!!!) <- Moved in 2nd stage
  • restbase making 600 rps <- Moved in 2nd stage
  • ores making 75-100 rps <- Deprecated
  • wikifeeds making ~ 70 rps <- Moved in 2nd stage
  • flink making ~ 40 rps <- Moved in 2nd stage

Everything else is basically marginal.

I propose we start moving all services on kubernetes to use mw-api-int now, with the exception of the ones named above.

Kubernetes services calling mediawiki

Related Objects

StatusSubtypeAssignedTask
ResolvedNone
ResolvedClement_Goubert
ResolvedClement_Goubert
ResolvedClement_Goubert
ResolvedClement_Goubert
InvalidClement_Goubert
ResolvedJoe
ResolvedClement_Goubert
ResolvedClement_Goubert
ResolvedClement_Goubert
ResolvedClement_Goubert
ResolvedJoe
ResolvedJoe
ResolvedJoe
ResolvedJMeybohm
ResolvedJoe
ResolvedClement_Goubert
ResolvedClement_Goubert
ResolvedClement_Goubert
DeclinedClement_Goubert
ResolvedClement_Goubert
Resolvedelukey

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

Mentioned in SAL (#wikimedia-operations) [2023-03-29T09:57:03Z] <claime> Adding mw-api-int to service_catalog in service_setup - T333120

Change 903217 merged by Clément Goubert:

[operations/puppet@production] service_catalog: Add mw-api-int k8s service - 1

https://gerrit.wikimedia.org/r/903217

Mentioned in SAL (#wikimedia-operations) [2023-03-29T09:58:56Z] <claime> running puppet on O:kubernetes::worker and O:lvs::balancer - T333120

Mentioned in SAL (#wikimedia-operations) [2023-03-29T10:37:09Z] <claime> Switching mw-api-int to lvs_setup - T333120

Change 904060 merged by Clément Goubert:

[operations/puppet@production] service_catalog: Add mw-api-int k8s service - 2

https://gerrit.wikimedia.org/r/904060

Mentioned in SAL (#wikimedia-operations) [2023-03-29T10:41:03Z] <cgoubert@cumin1001> START - Cookbook sre.loadbalancer.restart-pybal rolling-restart of pybal on P{lvs1020*,lvs2010*} and A:lvs (T333120)

Mentioned in SAL (#wikimedia-operations) [2023-03-29T10:42:59Z] <cgoubert@cumin1001> END (PASS) - Cookbook sre.loadbalancer.restart-pybal (exit_code=0) rolling-restart of pybal on P{lvs1020*,lvs2010*} and A:lvs (T333120)

Mentioned in SAL (#wikimedia-operations) [2023-03-29T10:46:23Z] <cgoubert@cumin1001> START - Cookbook sre.loadbalancer.restart-pybal rolling-restart of pybal on P{lvs1019*,lvs2009*} and A:lvs (T333120)

Mentioned in SAL (#wikimedia-operations) [2023-03-29T10:49:17Z] <cgoubert@cumin1001> END (PASS) - Cookbook sre.loadbalancer.restart-pybal (exit_code=0) rolling-restart of pybal on P{lvs1019*,lvs2009*} and A:lvs (T333120)

Mentioned in SAL (#wikimedia-operations) [2023-03-29T10:50:22Z] <claime> START - Cookbook sre.loadbalancer.restart-pybal rolling-restart of pybal on P{lvs1019*,lvs2009*} and A:lvs (T333120)

Mentioned in SAL (#wikimedia-operations) [2023-03-29T10:50:57Z] <claime> Switching mw-api-int to production - T333120

Change 904061 merged by Clément Goubert:

[operations/puppet@production] service_catalog: Add mw-api-int k8s service - 3

https://gerrit.wikimedia.org/r/904061

Mentioned in SAL (#wikimedia-operations) [2023-03-29T10:52:30Z] <claime> Running puppet on dns-auth - T333120

Change 904065 merged by Clément Goubert:

[operations/dns@master] mw-api-int: add discovery records

https://gerrit.wikimedia.org/r/904065

Mentioned in SAL (#wikimedia-operations) [2023-03-29T10:58:30Z] <claime> authdns-update successful on all nodes - T333120

mw-api-int and mw-api-int-ro services now in production, we can proceed with creating the envoy listeners in https://gerrit.wikimedia.org/r/c/operations/puppet/+/903595/ and then switching services to use them.

Change 903595 merged by Clément Goubert:

[operations/puppet@production] P:services_proxy::envoy: Add mw-api-int

https://gerrit.wikimedia.org/r/903595

Change 908542 had a related patch set uploaded (by Clément Goubert; author: Clément Goubert):

[operations/deployment-charts@master] admin_ng: Add mw-on-k8s Egress rules

https://gerrit.wikimedia.org/r/908542

Change 908542 merged by jenkins-bot:

[operations/deployment-charts@master] admin_ng: Add mw-on-k8s Egress rules

https://gerrit.wikimedia.org/r/908542

Change 908553 had a related patch set uploaded (by Clément Goubert; author: Clément Goubert):

[operations/deployment-charts@master] cxserver: Add mesh egress

https://gerrit.wikimedia.org/r/908553

Change 908553 merged by jenkins-bot:

[operations/deployment-charts@master] cxserver: Add mesh egress

https://gerrit.wikimedia.org/r/908553

Change #1043062 had a related patch set uploaded (by Clément Goubert; author: Clément Goubert):

[operations/puppet@production] mediawiki: Switch backend calls to mw-api-int

https://gerrit.wikimedia.org/r/1043062

Change #1043062 merged by Clément Goubert:

[operations/puppet@production] mediawiki: Switch backend calls to mw-api-int

https://gerrit.wikimedia.org/r/1043062

Mentioned in SAL (#wikimedia-operations) [2024-06-13T14:15:25Z] <cgoubert@deploy1002> Started scap: Change mwapi listener to mw-api-int - T333120

Mentioned in SAL (#wikimedia-operations) [2024-06-13T14:21:24Z] <cgoubert@deploy1002> Finished scap: Change mwapi listener to mw-api-int - T333120 (duration: 06m 47s)

Looks like this is now done except for "some straggling traffic" for the api-gateway?

image.png (958×2 px, 146 KB)

Yes, but I will close it when I'm sure I have zero internal traffic on the bare metal clusters.

I believe the straggling traffic here is a misnomer/a graph misunderstanding - the API gateway's envoy config refers to traffic to the mediawiki API as "mwapi_cluster" internally before and after the hostname was changed. I believe these requests are normal and are being routed to k8s already - drop mwapi_cluster from the graph and we're at zero! 🎉

All internal traffic has been migrated.