Page MenuHomePhabricator

Move 25% of mediawiki external requests to mw on k8s
Closed, ResolvedPublic

Description

Move external traffic progressively at 15, 20, 25% to mw on k8s.

Info from T351074: Move servers from the appserver/api cluster to kubernetes:

For every 5% of external traffic we move, we've needed to bump mw-web by 12-13 replicas and mw-api-ext by 10 replicas.

This means that for every 5% increase in traffic, we're requiring 22-23 additional replicas. Given every pod requires 5.6 CPUs it means we're going to need about 123 cores per traffic bump, or roughly 3 servers as our servers have 48 cores each.

The above calculation is per-datacenter, of course.

Event Timeline

Change 964447 had a related patch set uploaded (by Clément Goubert; author: Clément Goubert):

[operations/puppet@production] trafficserver: move 15% of traffic to mw on k8s

https://gerrit.wikimedia.org/r/964447

Change 964448 had a related patch set uploaded (by Clément Goubert; author: Clément Goubert):

[operations/puppet@production] trafficserver: move 20% of traffic to mw on k8s

https://gerrit.wikimedia.org/r/964448

Change 964449 had a related patch set uploaded (by Clément Goubert; author: Clément Goubert):

[operations/puppet@production] trafficserver: move 25% of traffic to mw on k8s

https://gerrit.wikimedia.org/r/964449

Change 964457 had a related patch set uploaded (by Clément Goubert; author: Clément Goubert):

[operations/deployment-charts@master] mw-api-ext, mw-web: Raise replicas 50%

https://gerrit.wikimedia.org/r/964457

The Kubernetes work so far has caused problems with cross-wiki Echo notifications (see T223413, T342201). Please help resolve this before further rollouts. Thanks!

Clement_Goubert changed the task status from Open to In Progress.Oct 11 2023, 2:42 PM

The Kubernetes work so far has caused problems with cross-wiki Echo notifications (see T223413, T342201). Please help resolve this before further rollouts. Thanks!

We have fixed the cause of T342201, but I wouldn't know how to further help on T223413 unless the root cause was the same (which I suppose is possible but not guaranteed).

It seems it was the same cause, as both issues look fixed to me. Thanks!

Change 964457 merged by jenkins-bot:

[operations/deployment-charts@master] mw-api-ext, mw-web: Raise replicas 50%

https://gerrit.wikimedia.org/r/964457

Change 964447 merged by Giuseppe Lavagetto:

[operations/puppet@production] trafficserver: move 15% of traffic to mw on k8s

https://gerrit.wikimedia.org/r/964447

Change 974514 had a related patch set uploaded (by Clément Goubert; author: Clément Goubert):

[operations/deployment-charts@master] mw-web, mw-api-ext: Raise replicas for 20% traffic

https://gerrit.wikimedia.org/r/974514

Change 974514 merged by jenkins-bot:

[operations/deployment-charts@master] mw-web, mw-api-ext: Raise replicas for 20% traffic

https://gerrit.wikimedia.org/r/974514

Change 964448 merged by Clément Goubert:

[operations/puppet@production] trafficserver: move 20% of traffic to mw on k8s

https://gerrit.wikimedia.org/r/964448

Mentioned in SAL (#wikimedia-operations) [2023-11-15T14:35:11Z] <claime> Raised mw-on-k8s to 20% of external traffic, rollout will happen over the next half hour - T348122

Change 976689 had a related patch set uploaded (by Clément Goubert; author: Clément Goubert):

[operations/deployment-charts@master] mw-web, mw-api-ext: Raise replicas for 25% traffic

https://gerrit.wikimedia.org/r/976689

Change 976689 merged by jenkins-bot:

[operations/deployment-charts@master] mw-web, mw-api-ext: Raise replicas for 25% traffic

https://gerrit.wikimedia.org/r/976689

Mentioned in SAL (#wikimedia-operations) [2023-11-22T12:59:05Z] <claime> Raising mw-web and mw-api-ext replicas for traffic bump - T348122

Change 964449 merged by Clément Goubert:

[operations/puppet@production] trafficserver: move 25% of traffic to mw on k8s

https://gerrit.wikimedia.org/r/964449