Page MenuHomePhabricator

🚀 Southward Datacenter Switchover (Sept. 2025)
Closed, ResolvedPublic

Description

This is an umbrella task for the upcoming Sept 2025 Southward (eqiad to codfw) DC Switchover.

As of Sept 2023, switchovers take place at predictable dates; the work week of the Solar Equinox.

Important Dates:

For the next 7 calendar days after the read-only phase of the Switchover, traffic will be flowing solely to one of the 2 datacenters (in this case codfw) effectively rendering the other DC (eqiad) inactive. On the Thursday following the read-only phase of the Switchover, after exactly 7 days, traffic will start flowing, in the normal Multi-DC way, to both data centers.

If you have any issues or related work, please file your tasks below.

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

Change #1187544 had a related patch set uploaded (by Jasmine; author: Jasmine):

[operations/cookbooks@master] switchdc: call delete_collection_namespaced_cron_job if available

https://gerrit.wikimedia.org/r/1187544

Change #1187544 merged by jenkins-bot:

[operations/cookbooks@master] switchdc: call delete_collection_namespaced_cron_job if available

https://gerrit.wikimedia.org/r/1187544

jasmine@cumin1003 - Cookbook cookbooks.sre.switchdc.mediawiki.00-downtime-db-readonly-checks for datacenter switchover from codfw to eqiad - finished with status: SUCCESS elapsed time: 0:00:13.715927

Mentioned in SAL (#wikimedia-operations) [2025-09-18T16:16:02Z] <jasmine@deploy1003> Locking from deployment [ALL REPOSITORIES]: Datacenter Switchover - T399891

jasmine@cumin1003 - Cookbook cookbooks.sre.switchdc.mediawiki.00-reduce-ttl for datacenter switchover from codfw to eqiad - finished with status: SUCCESS elapsed time: 0:05:41.797885

jasmine@cumin1003 - Cookbook cookbooks.sre.switchdc.mediawiki.01-stop-maintenance for datacenter switchover from codfw to eqiad - finished with status: SUCCESS elapsed time: 0:00:00.291898

jasmine@cumin1003 - Cookbook cookbooks.sre.switchdc.mediawiki.02-set-readonly for datacenter switchover from codfw to eqiad - [DRY-RUN] MediaWiki read-only period starts at: 2025-09-18 16:19:18.465479

jasmine@cumin1003 - Cookbook cookbooks.sre.switchdc.mediawiki.02-set-readonly for datacenter switchover from codfw to eqiad - finished with status: SUCCESS elapsed time: 0:00:17.956647

jasmine@cumin1003 - Cookbook cookbooks.sre.switchdc.mediawiki.03-set-db-readonly for datacenter switchover from codfw to eqiad - finished with status: SUCCESS elapsed time: 0:00:48.444685

jasmine@cumin1003 - Cookbook cookbooks.sre.switchdc.mediawiki.04-switch-mediawiki for datacenter switchover from codfw to eqiad - finished with status: SUCCESS elapsed time: 0:00:13.533196

jasmine@cumin1003 - Cookbook cookbooks.sre.switchdc.mediawiki.06-set-db-readwrite for datacenter switchover from codfw to eqiad - finished with status: SUCCESS elapsed time: 0:00:03.341663

jasmine@cumin1003 - Cookbook cookbooks.sre.switchdc.mediawiki.07-set-readwrite for datacenter switchover from codfw to eqiad - [DRY-RUN] MediaWiki read-only period ends at: 2025-09-18 16:21:21.591133

jasmine@cumin1003 - Cookbook cookbooks.sre.switchdc.mediawiki.07-set-readwrite for datacenter switchover from codfw to eqiad - finished with status: SUCCESS elapsed time: 0:00:05.392814

jasmine@cumin1003 - Cookbook cookbooks.sre.switchdc.mediawiki.08-restart-mw-jobrunner for datacenter switchover from codfw to eqiad - finished with status: SUCCESS elapsed time: 0:00:24.615379

jasmine@cumin1003 - Cookbook cookbooks.sre.switchdc.mediawiki.08-start-maintenance for datacenter switchover from codfw to eqiad - finished with status: SUCCESS elapsed time: 0:00:12.101005

jasmine@cumin1003 - Cookbook cookbooks.sre.switchdc.mediawiki.09-restore-ttl for datacenter switchover from codfw to eqiad - finished with status: SUCCESS elapsed time: 0:00:30.382926

Mentioned in SAL (#wikimedia-operations) [2025-09-18T16:32:29Z] <jasmine@deploy1003> Unlocked for deployment [ALL REPOSITORIES]: Datacenter Switchover - T399891 (duration: 16m 26s)

jasmine@cumin1003 - Cookbook cookbooks.sre.switchdc.mediawiki.09-run-puppet-on-db-masters for datacenter switchover from codfw to eqiad - finished with status: SUCCESS elapsed time: 0:12:11.860413

Change #1189587 had a related patch set uploaded (by Jasmine; author: Jasmine):

[operations/dns@master] wmnet: update CNAME records for DB masters to codfw

https://gerrit.wikimedia.org/r/1189587

Change #1189598 had a related patch set uploaded (by Jasmine; author: Jasmine):

[operations/dns@master] geo-maps: update map default to list codfw first

https://gerrit.wikimedia.org/r/1189598

Change #1189939 had a related patch set uploaded (by Gerrit maintenance bot; author: Gerrit maintenance bot):

[operations/dns@master] wmnet: update CNAME records for DB masters for dc switchover

https://gerrit.wikimedia.org/r/1189939

Change #1190293 had a related patch set uploaded (by Jasmine; author: Jasmine):

[operations/mediawiki-config@master] debug.json: order codfw (primary) DC backends first

https://gerrit.wikimedia.org/r/1190293

Change #1190298 had a related patch set uploaded (by Jasmine; author: Jasmine):

[operations/dns@master] wmnet: update deployment CNAME record to deploy2002

https://gerrit.wikimedia.org/r/1190298

Change #1190300 had a related patch set uploaded (by Jasmine; author: Jasmine):

[operations/puppet@production] hieradata: update deployment_server to deploy2002

https://gerrit.wikimedia.org/r/1190300

Just a note that the following discovery records were created after the last switchover and as such will be part of the global process for the first time:

search.discovery.wmnet
search-chi.discovery.wmnet
search-psi.discovery.wmnet

They will cause a latency penalty for MediaWiki search traffic coming from eqiad until all MediaWiki traffic is switched over to codfw.

Thanks to @dcausse for the heads up.

Change #1190697 had a related patch set uploaded (by Scott French; author: Scott French):

[operations/deployment-charts@master] mw-web: upsize for single-DC serving

https://gerrit.wikimedia.org/r/1190697

Change #1190697 merged by jenkins-bot:

[operations/deployment-charts@master] mw-web: upsize for single-DC serving

https://gerrit.wikimedia.org/r/1190697

Mentioned in SAL (#wikimedia-operations) [2025-09-23T15:22:59Z] <swfrench-wmf> upsizing mw-web in advance of services switchover - T399891

Mentioned in SAL (#wikimedia-operations) [2025-09-23T15:30:28Z] <jasmine@cumin1003> START - Cookbook sre.dns.admin DNS admin: depool site eqiad [reason: Moving traffic to codfw, Southward DC Switchover Day 1, T399891]

Mentioned in SAL (#wikimedia-operations) [2025-09-23T15:30:51Z] <jasmine@cumin1003> END (PASS) - Cookbook sre.dns.admin (exit_code=0) DNS admin: depool site eqiad [reason: Moving traffic to codfw, Southward DC Switchover Day 1, T399891]

jasmine@cumin1003 - Cookbook cookbooks.sre.discovery.datacenter depool all services in eqiad: Moving services to codfw, Southward DC Switchover Day 1 - T399891 started.

Mentioned in SAL (#wikimedia-operations) [2025-09-23T15:52:22Z] <jasmine@cumin1003> START - Cookbook sre.discovery.datacenter depool all services in eqiad: Moving services to codfw, Southward DC Switchover Day 1 - T399891

jasmine@cumin1003 - Cookbook cookbooks.sre.discovery.datacenter depool all services in eqiad: Moving services to codfw, Southward DC Switchover Day 1 - T399891 failed.

Mentioned in SAL (#wikimedia-operations) [2025-09-23T16:03:29Z] <jasmine@cumin1003> END (ERROR) - Cookbook sre.discovery.datacenter (exit_code=93) depool all services in eqiad: Moving services to codfw, Southward DC Switchover Day 1 - T399891

jasmine@cumin1003 - Cookbook cookbooks.sre.discovery.datacenter depool all services in eqiad: Moving services to codfw, Southward DC Switchover Day 1 - T399891 started.

Mentioned in SAL (#wikimedia-operations) [2025-09-23T16:03:46Z] <jasmine@cumin1003> START - Cookbook sre.discovery.datacenter depool all services in eqiad: Moving services to codfw, Southward DC Switchover Day 1 - T399891

Mentioned in SAL (#wikimedia-operations) [2025-09-23T16:25:41Z] <bking@cumin1002> START - Cookbook sre.elasticsearch.ban Banning hosts: cirrussearch2093.codfw.wmnet for thread pool rejections - bking@cumin1002 - T399891

Mentioned in SAL (#wikimedia-operations) [2025-09-23T16:25:47Z] <bking@cumin1002> END (PASS) - Cookbook sre.elasticsearch.ban (exit_code=0) Banning hosts: cirrussearch2093.codfw.wmnet for thread pool rejections - bking@cumin1002 - T399891

jasmine@cumin1003 - Cookbook cookbooks.sre.discovery.datacenter depool all services in eqiad: Moving services to codfw, Southward DC Switchover Day 1 - T399891 completed.

Mentioned in SAL (#wikimedia-operations) [2025-09-23T16:31:38Z] <jasmine@cumin1003> END (PASS) - Cookbook sre.discovery.datacenter (exit_code=0) depool all services in eqiad: Moving services to codfw, Southward DC Switchover Day 1 - T399891

Mentioned in SAL (#wikimedia-operations) [2025-09-24T14:35:40Z] <jasmine@deploy1003> Locking from deployment [ALL REPOSITORIES]: Datacenter Switchover - T399891

jasmine@cumin1003 - Cookbook cookbooks.sre.switchdc.mediawiki.00-downtime-db-readonly-checks for datacenter switchover from eqiad to codfw - finished with status: SUCCESS elapsed time: 0:00:13.941469

jasmine@cumin1003 - Cookbook cookbooks.sre.switchdc.mediawiki.00-reduce-ttl for datacenter switchover from eqiad to codfw - finished with status: SUCCESS elapsed time: 0:05:33.392529

jasmine@cumin1003 - Cookbook cookbooks.sre.switchdc.mediawiki.01-stop-maintenance for datacenter switchover from eqiad to codfw - finished with status: SUCCESS elapsed time: 0:01:51.367197

jasmine@cumin1003 - Cookbook cookbooks.sre.switchdc.mediawiki.02-set-readonly for datacenter switchover from eqiad to codfw - MediaWiki read-only period starts at: 2025-09-24 15:02:35.395589

jasmine@cumin1003 - Cookbook cookbooks.sre.switchdc.mediawiki.02-set-readonly for datacenter switchover from eqiad to codfw - finished with status: SUCCESS elapsed time: 0:00:27.827057

jasmine@cumin1003 - Cookbook cookbooks.sre.switchdc.mediawiki.03-set-db-readonly for datacenter switchover from eqiad to codfw - finished with status: SUCCESS elapsed time: 0:00:49.198461

jasmine@cumin1003 - Cookbook cookbooks.sre.switchdc.mediawiki.04-switch-mediawiki for datacenter switchover from eqiad to codfw - finished with status: SUCCESS elapsed time: 0:00:39.797481

jasmine@cumin1003 - Cookbook cookbooks.sre.switchdc.mediawiki.06-set-db-readwrite for datacenter switchover from eqiad to codfw - finished with status: SUCCESS elapsed time: 0:00:03.841884

jasmine@cumin1003 - Cookbook cookbooks.sre.switchdc.mediawiki.07-set-readwrite for datacenter switchover from eqiad to codfw - MediaWiki read-only period ends at: 2025-09-24 15:05:16.845948

jasmine@cumin1003 - Cookbook cookbooks.sre.switchdc.mediawiki.07-set-readwrite for datacenter switchover from eqiad to codfw - finished with status: SUCCESS elapsed time: 0:00:21.097168

jasmine@cumin1003 - Cookbook cookbooks.sre.switchdc.mediawiki.08-restart-mw-jobrunner for datacenter switchover from eqiad to codfw - finished with status: SUCCESS elapsed time: 0:00:29.537489

jasmine@cumin1003 - Cookbook cookbooks.sre.switchdc.mediawiki.08-start-maintenance for datacenter switchover from eqiad to codfw - finished with status: SUCCESS elapsed time: 0:02:07.927202

jasmine@cumin1003 - Cookbook cookbooks.sre.switchdc.mediawiki.09-restore-ttl for datacenter switchover from eqiad to codfw - finished with status: SUCCESS elapsed time: 0:00:33.261021

Change #1189587 merged by Jasmine:

[operations/dns@master] wmnet: update CNAME records for DB masters to codfw

https://gerrit.wikimedia.org/r/1189587

jasmine@cumin1003 - Cookbook cookbooks.sre.switchdc.mediawiki.09-run-puppet-on-db-masters for datacenter switchover from eqiad to codfw - finished with status: SUCCESS elapsed time: 0:11:56.776712

Change #1189598 merged by Jasmine:

[operations/dns@master] geo-maps: update map default to list codfw first

https://gerrit.wikimedia.org/r/1189598

Change #1190293 merged by jenkins-bot:

[operations/mediawiki-config@master] debug.json: order codfw (primary) DC backends first

https://gerrit.wikimedia.org/r/1190293

Mentioned in SAL (#wikimedia-operations) [2025-09-24T15:40:22Z] <jasmine@deploy1003> Unlocked for deployment [ALL REPOSITORIES]: Datacenter Switchover - T399891 (duration: 64m 41s)

Mentioned in SAL (#wikimedia-operations) [2025-09-24T15:41:19Z] <jasmine@deploy1003> Started scap sync-world: Backport for [[gerrit:1190293|debug.json: order codfw (primary) DC backends first (T399891)]]

Mentioned in SAL (#wikimedia-operations) [2025-09-24T15:47:51Z] <jasmine@deploy1003> jasmine: Backport for [[gerrit:1190293|debug.json: order codfw (primary) DC backends first (T399891)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there.

Mentioned in SAL (#wikimedia-operations) [2025-09-24T15:55:09Z] <jasmine@deploy1003> Finished scap sync-world: Backport for [[gerrit:1190293|debug.json: order codfw (primary) DC backends first (T399891)]] (duration: 13m 49s)

Icinga downtime and Alertmanager silence (ID=911d3db0-aa54-470b-b74f-5229af925589) set by jasmine@cumin1003 for 2:00:00 on 2 host(s) and their services with reason: Deployment server switchover

releases2003.codfw.wmnet,releases1003.eqiad.wmnet

Change #1190298 merged by Jasmine:

[operations/dns@master] wmnet: update deployment CNAME record to deploy2002

https://gerrit.wikimedia.org/r/1190298

Change #1190300 merged by Jasmine:

[operations/puppet@production] hieradata: update deployment_server to deploy2002

https://gerrit.wikimedia.org/r/1190300

Mentioned in SAL (#wikimedia-operations) [2025-09-25T16:52:55Z] <jasmine@deploy2002> Started scap sync-world: Test deployment to validate deployment server switchover - T399891.

Mentioned in SAL (#wikimedia-operations) [2025-09-25T17:32:23Z] <jasmine@deploy2002> Finished scap sync-world: Test deployment to validate deployment server switchover - T399891. (duration: 39m 28s)

There is a little issue with releases2003 trying to rsync /srv/patches from the active deployment server .. but it's still trying deploy1003 instead of deploy2002 and since the rsyncd there doesn't have the config (rsync modules) for it anymore, that fails now.

Which then leads to alerts about a failed systemd unit which then auto-creates a ticket: T405646.

Looking into that.

relevant code: class profile::releases::mediawiki::security

edit: puppet was disabled on releases2003 for debugging unrelated issue T405352. but because that was the case during the deployment server switchover it did not get the message that the deployment server changed.. which then caused the rest.

I just re-enabled puppet on that host and it made it replace deploy1003 with deploy2002 in a couple places as it should.. so it will sync security patches from the correct host again.

Mentioned in SAL (#wikimedia-operations) [2025-10-02T16:42:34Z] <jasmine@cumin1003> START - Cookbook sre.dns.admin DNS admin: pool site eqiad [reason: Repool Eqiad following DC switchover (T399891), T399891]

Mentioned in SAL (#wikimedia-operations) [2025-10-02T16:42:51Z] <jasmine@cumin1003> END (PASS) - Cookbook sre.dns.admin (exit_code=0) DNS admin: pool site eqiad [reason: Repool Eqiad following DC switchover (T399891), T399891]

jasmine@cumin1003 - Cookbook cookbooks.sre.discovery.datacenter pool all active/active services in eqiad: Repool services in Eqiad following DC switchover (T399891) - T399891 started.

Mentioned in SAL (#wikimedia-operations) [2025-10-02T17:03:23Z] <jasmine@cumin1003> START - Cookbook sre.discovery.datacenter pool all active/active services in eqiad: Repool services in Eqiad following DC switchover (T399891) - T399891

jasmine@cumin1003 - Cookbook cookbooks.sre.discovery.datacenter pool all active/active services in eqiad: Repool services in Eqiad following DC switchover (T399891) - T399891 completed.

Mentioned in SAL (#wikimedia-operations) [2025-10-02T17:25:44Z] <jasmine@cumin1003> END (PASS) - Cookbook sre.discovery.datacenter (exit_code=0) pool all active/active services in eqiad: Repool services in Eqiad following DC switchover (T399891) - T399891

Following up from discussion in IRC, shortly after MediaWiki (mw-api-ext, mw-web) RO traffic returned to eqiad around 17:10 UTC today, there was a ~ 5 minute period where circuit-breaking for es6 (cluster30) and es7 (cluster31) kicked in (mediawiki-errors logstash). This was accompanied by a spike in HTTP 500 responses with an aggregate peak of ~ 240 rps (ATS view), the vast majority of which were served by mw-web.

This resolved without intervention, and is likely the result of various caches being cold upon the return of traffic to eqiad, leading to an influx of connections to replicas in these sections. Spot checking connection counts on a handful of es6 replicas, it doesn't quite look like we got all the way up to the trigger threshold (500), but of course sparsely sampled metrics only tell part of the story.

In any case, an interesting case study in how circuit breaking as-implemented will respond to what's effectively a step-function in load.