Page MenuHomePhabricator

Direct 5% of all traffic to mw-on-k8s
Closed, ResolvedPublic

Description

Excluding votewiki, commons, and wikidata, redirect 5% of all production traffic to mw-on-k8s.
All traffic increases up until now have been done without having to change our deployment, however this one will be a bigger jump.
Thinking in terms of replicas and worker saturation, 1% of the traffic is ~25% worker saturation average, so going from 10 35 replicas per deployment should keep us just under 40% of worker saturation.
On the global wikikube cluster load, going to 1% increased the global CPU load spikes from 22 to 25%, so we should be ok but can still reduce traffic while we add more kubelets.

Estimated date: Week 30 (2023-07-24)

Event Timeline

Clement_Goubert created this task.
Clement_Goubert moved this task from Incoming 🐫 to this.quarter 🍕 on the serviceops board.

We'll first make the move to 2% of traffic, then ramp up from there during the week.

Change 940881 had a related patch set uploaded (by Clément Goubert; author: Clément Goubert):

[operations/deployment-charts@master] mediawiki: Raise workers per pod by ~20%

https://gerrit.wikimedia.org/r/940881

Change 951131 had a related patch set uploaded (by Clément Goubert; author: Clément Goubert):

[operations/puppet@production] mw-on-k8s: Raise traffic to 2%

https://gerrit.wikimedia.org/r/951131

Pending more hardware, we will move on to 2% first.

Change 951131 merged by Clément Goubert:

[operations/puppet@production] mw-on-k8s: Raise traffic to 2%

https://gerrit.wikimedia.org/r/951131

Mentioned in SAL (#wikimedia-operations) [2023-08-22T09:11:18Z] <claime> Redirecting 2% of global traffic to mw-on-k8s - T341780

Change 954000 had a related patch set uploaded (by Clément Goubert; author: Clément Goubert):

[operations/deployment-charts@master] mw-api-ext, mw-web: Raise total replicas to 13

https://gerrit.wikimedia.org/r/954000

Change 954002 had a related patch set uploaded (by Clément Goubert; author: Clément Goubert):

[operations/puppet@production] mw-on-k8s: Raise traffic to 4%

https://gerrit.wikimedia.org/r/954002

Change 954000 merged by jenkins-bot:

[operations/deployment-charts@master] mw-api-ext, mw-web: Raise total replicas to 13

https://gerrit.wikimedia.org/r/954000

Mentioned in SAL (#wikimedia-operations) [2023-09-01T08:40:04Z] <claime> Raised mw-web and mw-api-ext capacity by ~30% - T341780

Change 954002 merged by Clément Goubert:

[operations/puppet@production] mw-on-k8s: Raise traffic to 4%

https://gerrit.wikimedia.org/r/954002

Mentioned in SAL (#wikimedia-operations) [2023-09-01T09:02:24Z] <claime> Push 4% of global traffic to mw-on-k8s - T341780

Mentioned in SAL (#wikimedia-operations) [2023-09-01T09:04:07Z] <claime> Running puppet on 'A:cp-text and P{P:trafficserver::backend}' - T341780

Change 956388 had a related patch set uploaded (by Clément Goubert; author: Clément Goubert):

[operations/deployment-charts@master] mw-api-ext, mw-web: Raise total replicas to 14

https://gerrit.wikimedia.org/r/956388

Change 956390 had a related patch set uploaded (by Clément Goubert; author: Clément Goubert):

[operations/puppet@production] mw-on-k8s: Raise traffic to 5%

https://gerrit.wikimedia.org/r/956390

Change 956388 merged by jenkins-bot:

[operations/deployment-charts@master] mw-api-ext, mw-web: Raise total replicas to 14

https://gerrit.wikimedia.org/r/956388

Mentioned in SAL (#wikimedia-operations) [2023-09-12T08:51:29Z] <claime> mw-api-ext, mw-web: Raise total replicas to 14 - T341780

Change 956390 merged by Clément Goubert:

[operations/puppet@production] mw-on-k8s: Raise traffic to 5%

https://gerrit.wikimedia.org/r/956390

Mentioned in SAL (#wikimedia-operations) [2023-09-12T08:58:04Z] <claime> Sending 5% of global traffic to mw-on-k8s - T341780

Mentioned in SAL (#wikimedia-operations) [2023-09-12T08:58:26Z] <claime> Running puppet on cp-text P:trafficserver::backend - T341780

Is this possibly related to toolforge.org being entirely unavailable right now? No matter what I try, everything fails with 502 or 503. Prominent example: https://admin.toolforge.org.

Is this possibly related to toolforge.org being entirely unavailable right now? No matter what I try, everything fails with 502 or 503. Prominent example: https://admin.toolforge.org.

I doubt it. That's a completely separate infrastructure.

Is this possibly related to toolforge.org being entirely unavailable right now? No matter what I try, everything fails with 502 or 503. Prominent example: https://admin.toolforge.org.

totally unrelated, toolforge doesn't use the text cluster of the CDN

Change 956830 had a related patch set uploaded (by Clément Goubert; author: Clément Goubert):

[operations/deployment-charts@master] mw-web: Raise apc size to 1536

https://gerrit.wikimedia.org/r/956830

Change 956830 merged by jenkins-bot:

[operations/deployment-charts@master] mw-web: Raise apc size to 1536

https://gerrit.wikimedia.org/r/956830

We are now serving 5% of global traffic from mw-on-k8s. Resolving.

Change 959769 had a related patch set uploaded (by JMeybohm; author: JMeybohm):

[operations/puppet@production] Revert "mw-on-k8s: Lower traffic to 3%"

https://gerrit.wikimedia.org/r/959769

Change 959769 merged by JMeybohm:

[operations/puppet@production] Revert "mw-on-k8s: Lower traffic to 3%"

https://gerrit.wikimedia.org/r/959769

Change 940881 abandoned by Clément Goubert:

[operations/deployment-charts@master] mediawiki: Raise workers per pod by ~20%

Reason:

Superseded

https://gerrit.wikimedia.org/r/940881