Page MenuHomePhabricator

☂️ Northward Datacentre Switchover (March 2024)
Closed, ResolvedPublic

Description

This is an umbrella ☂️ task for the upcoming Northward Switchover.

As of Sept 2023, switchovers take place at predictable dates; the work week of the Solar Equinox.

Important Dates:

Day 1 issues:

  • Kartotherian started running out of resources, so we had to repool kartotherian on codfw and restart the service on both datacentres
  • Thumbor was using swift.discovery.wmnet, thus thumbor on codfw was attempting to access swift on eqiad using codfw's creds, causing tons of 401s.
  • mw-on-k8s started working harder than usual, expected since we turned off multi-DC, we added some more resources just to be on the safe side. Specifically, we added 53 replicas to mw-web and 10 to mw-api-ext.
  • Unfortunate coincidence where around the services switchover, changeprop was overwhelmed for unrelated reasons, causing jobs to pile up

Day 2 issues:

  • While stopping all maintenance scripts (01-stop-maintenance), we found a user triggered script which we fiercely killed manually, and continued the process

Day 3 issues:

  • We switched to deploy1002.eqiad.wmnet without any issues.

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

What is the idea? Will codfw remain depooled for a week or two? For DBAs this would be good so we can perform some maintenance in codfw.

As documented in https://wikitech.wikimedia.org/wiki/Switch_Datacenter/Recurring,_Equinox-based,_Data_Center_Switchovers#Overview.

For the next 7 calendar days after the read-only phase of the Switchover, traffic will be flowing solely to one of the 2 data centers, effectively rendering the other data center inactive.
On the Wednesday following the read-only phase of the Switchover, that is right after exactly 7 days, traffic will start flowing, in the normal Multi-DC way, to both data centers.

This has been done for the September 2023 Switchover and will be done for this one too.

We do have an open item to see if it makes sense to limit this from 7 days to 5 by repooling in Multi-DC for the duration of the weekend, but this is best handled in a different task and at the proper timing.

I'd love if it can be a bit longer than 7 days as we can do lots of operational maintenance and save a bunch of time, but anyway, to be discussed once the time is approaching.
Thanks Alex

jijiki renamed this task from Northward Datacentre Switchover (March 2024) to ☂️ Northward Datacentre Switchover (March 2024) .Feb 22 2024, 3:04 PM
jijiki updated the task description. (Show Details)

Change 1009854 had a related patch set uploaded (by Kamila Součková; author: Kamila Součková):

[operations/cookbooks@master] sre.switchdc.mediawiki: update descriptions

https://gerrit.wikimedia.org/r/1009854

FYI: I redid a dry run and live test for 01-stop-maintenance.py after https://gerrit.wikimedia.org/r/1008583 and it's good to go.

Change 1012645 had a related patch set uploaded (by Effie Mouzeli; author: Effie Mouzeli):

[operations/dns@master] traffic: Completely depool codfw from user traffic (switchover #1)

https://gerrit.wikimedia.org/r/1012645

Change 1009854 merged by jenkins-bot:

[operations/cookbooks@master] sre.switchdc.mediawiki: update descriptions

https://gerrit.wikimedia.org/r/1009854

Change 1012645 merged by Effie Mouzeli:

[operations/dns@master] traffic: Completely depool codfw from user traffic (switchover #1)

https://gerrit.wikimedia.org/r/1012645

Mentioned in SAL (#wikimedia-operations) [2024-03-19T14:07:25Z] <effie> Completely depool codfw from user traffic - T357547

jiji@cumin1002 - Cookbook cookbooks.sre.discovery.datacenter depool all services in codfw: Northward DC Switchover, March 2024 - T357547 started.

Mentioned in SAL (#wikimedia-operations) [2024-03-19T14:16:55Z] <jiji@cumin1002> START - Cookbook sre.discovery.datacenter depool all services in codfw: Northward DC Switchover, March 2024 - T357547

Mentioned in SAL (#wikimedia-operations) [2024-03-19T14:22:12Z] <effie> depooling services from codfw - T357547

jiji@cumin1002 - Cookbook cookbooks.sre.discovery.datacenter depool all services in codfw: Northward DC Switchover, March 2024 - T357547 completed.

Mentioned in SAL (#wikimedia-operations) [2024-03-19T14:40:03Z] <jiji@cumin1002> END (PASS) - Cookbook sre.discovery.datacenter (exit_code=0) depool all services in codfw: Northward DC Switchover, March 2024 - T357547

Change 1012686 had a related patch set uploaded (by Clément Goubert; author: Clément Goubert):

[operations/deployment-charts@master] mw-on-k8s: raise replicas for add ro traffic

https://gerrit.wikimedia.org/r/1012686

Mentioned in SAL (#wikimedia-operations) [2024-03-19T15:17:27Z] <claime> Raising mw-web and mw-api-ext replicas for additional read-only traffic - T357547

Change 1012686 merged by jenkins-bot:

[operations/deployment-charts@master] mw-on-k8s: raise replicas for add ro traffic

https://gerrit.wikimedia.org/r/1012686

Change 1012698 had a related patch set uploaded (by Clément Goubert; author: Clément Goubert):

[operations/deployment-charts@master] mw-web: Bump replicas another 15%

https://gerrit.wikimedia.org/r/1012698

Change 1012698 merged by jenkins-bot:

[operations/deployment-charts@master] mw-web: Bump replicas another 15%

https://gerrit.wikimedia.org/r/1012698

We had to repool kartotherian in codfw as we had a CPU exhaustion event in eqiad right after the services switchover. Since some kartotherian endpoints create an amplification effect to kartotherian itself, we opted for restarting kartotherian in eqiad to fix that.

Change 1012706 had a related patch set uploaded (by Clément Goubert; author: Clément Goubert):

[operations/alerts@master] mw-on-k8s: Lower idle %age for saturation alert

https://gerrit.wikimedia.org/r/1012706

Some tweaking of replicas size was needed on mw-on-k8s, which was expected as this is the first switchover where more of the external traffic goes to it than to bare-metal clusters.

Change 1012706 merged by jenkins-bot:

[operations/alerts@master] mw-on-k8s: Lower idle %age for saturation alert

https://gerrit.wikimedia.org/r/1012706

Change 1013005 had a related patch set uploaded (by Effie Mouzeli; author: Effie Mouzeli):

[operations/dns@master] wmnet: Update DNS records for master dbs to eqiad (switchover #2)

https://gerrit.wikimedia.org/r/1013005

Noting here for future reference - we found that thumbor was incorrectly using the global discovery record for swift, which meant that codfw-thumbor was trying to talk to eqiad-swift after codfw-swift was depooled, resulting in a rise in TempAuth errors (and 401s):

swift_tempauth_graph.png (812×1 px, 91 KB)

This was fixed via (initially repooling codfw swift and) https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/1013033

Mentioned in SAL (#wikimedia-operations) [2024-03-20T13:48:50Z] <jiji@deploy2002> Locking from deployment [ALL REPOSITORIES]: Datacenter Switchover - T357547

Change 1013005 merged by Effie Mouzeli:

[operations/dns@master] DBs: Update DNS records for master DBs to eqiad (switchover #2)

https://gerrit.wikimedia.org/r/1013005

Change 1013064 had a related patch set uploaded (by Effie Mouzeli; author: Effie Mouzeli):

[operations/dns@master] maintenance: Update DNS records for maintenance host (switchover #3)

https://gerrit.wikimedia.org/r/1013064

Change 1013064 merged by Effie Mouzeli:

[operations/dns@master] maintenance: Update DNS records for maintenance host (switchover #3)

https://gerrit.wikimedia.org/r/1013064

Mentioned in SAL (#wikimedia-operations) [2024-03-20T14:50:18Z] <jiji@deploy2002> Unlocked for deployment [ALL REPOSITORIES]: Datacenter Switchover - T357547 (duration: 61m 28s)

Change 1013070 had a related patch set uploaded (by Effie Mouzeli; author: Effie Mouzeli):

[operations/dns@master] geo-maps: make eqiad the default datacentre (switchover #4)

https://gerrit.wikimedia.org/r/1013070

Change 1013070 merged by Effie Mouzeli:

[operations/dns@master] geo-maps: make eqiad the default datacentre (switchover #4)

https://gerrit.wikimedia.org/r/1013070

Change 1013083 had a related patch set uploaded (by Effie Mouzeli; author: Effie Mouzeli):

[operations/mediawiki-config@master] debug.json: List primary DC servers first (switchover #5)

https://gerrit.wikimedia.org/r/1013083

jijiki updated the task description. (Show Details)
jijiki updated the task description. (Show Details)

Change 1013083 merged by jenkins-bot:

[operations/mediawiki-config@master] debug.json: List primary DC servers first (switchover #5)

https://gerrit.wikimedia.org/r/1013083

Mentioned in SAL (#wikimedia-operations) [2024-03-20T15:58:14Z] <jiji@deploy2002> Started scap: Backport for [[gerrit:1013083|debug.json: List primary DC servers first (switchover #5) (T357547)]]

Mentioned in SAL (#wikimedia-operations) [2024-03-20T16:03:34Z] <jiji@deploy2002> jiji: Backport for [[gerrit:1013083|debug.json: List primary DC servers first (switchover #5) (T357547)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug)

Mentioned in SAL (#wikimedia-operations) [2024-03-20T16:22:41Z] <jiji@deploy2002> Finished scap: Backport for [[gerrit:1013083|debug.json: List primary DC servers first (switchover #5) (T357547)]] (duration: 24m 27s)

Change 1013272 had a related patch set uploaded (by Effie Mouzeli; author: Effie Mouzeli):

[operations/dns@master] deployment: update deployment DNS record to deploy1002 (switchover #6)

https://gerrit.wikimedia.org/r/1013272

Change 1013274 had a related patch set uploaded (by Effie Mouzeli; author: Effie Mouzeli):

[operations/puppet@production] hieradata: update deployment_server to deploy1002 (switchover #7)

https://gerrit.wikimedia.org/r/1013274

Change 1013272 merged by Effie Mouzeli:

[operations/dns@master] deployment: update deployment DNS record to deploy1002 (switchover #6)

https://gerrit.wikimedia.org/r/1013272

Change 1013274 merged by Effie Mouzeli:

[operations/puppet@production] hieradata: update deployment_server to deploy1002 (switchover #7)

https://gerrit.wikimedia.org/r/1013274

jijiki updated the task description. (Show Details)

Change #1015037 had a related patch set uploaded (by Effie Mouzeli; author: Effie Mouzeli):

[operations/dns@master] traffic: Pool codfw for user traffic (switchover #8)

https://gerrit.wikimedia.org/r/1015037

Mentioned in SAL (#wikimedia-operations) [2024-03-27T14:19:08Z] <effie> Day 8: Pool active/active services on codfw - T357547

jiji@cumin1002 - Cookbook cookbooks.sre.discovery.datacenter pool all active/active services in codfw: Pool active/active services on codfw - T357547 started.

Mentioned in SAL (#wikimedia-operations) [2024-03-27T14:21:28Z] <jiji@cumin1002> START - Cookbook sre.discovery.datacenter pool all active/active services in codfw: Pool active/active services on codfw - T357547

jiji@cumin1002 - Cookbook cookbooks.sre.discovery.datacenter pool all active/active services in codfw: Pool active/active services on codfw - T357547 completed.

Mentioned in SAL (#wikimedia-operations) [2024-03-27T14:40:59Z] <jiji@cumin1002> END (PASS) - Cookbook sre.discovery.datacenter (exit_code=0) pool all active/active services in codfw: Pool active/active services on codfw - T357547

Mentioned in SAL (#wikimedia-operations) [2024-03-27T14:45:32Z] <effie> Day 8: Pool codfw for user traffic - T357547

Change #1015037 merged by Effie Mouzeli:

[operations/dns@master] traffic: Pool codfw for user traffic (switchover #8)

https://gerrit.wikimedia.org/r/1015037

jijiki claimed this task.

Switchover is done, it is Day 8, and we are back to Multi-DC. Thank you serviceops and @akosiaris for being good teammates and keeping an eye on things.