Page MenuHomePhabricator

Northward Datacenter Switchover (March 2026; codfw to eqiad)
Closed, ResolvedPublic

Description

This is an umbrella task for the upcoming March 2026 Northward (codfw to eqiad) DC Switchover.

As of Sept 2023, switchovers take place at predictable dates; the work week of the Solar Equinox.

Important Dates:

For the next 7 calendar days after the read-only phase of the Switchover, traffic will be flowing solely to one of the 2 datacenters (in this case eqiad) effectively rendering the other DC (codfw) inactive. On the Thursday following the read-only phase of the Switchover, after exactly 7 days, traffic will start flowing, in the normal Multi-DC way, to both data centers.

If you have any issues or related work, please file your tasks below.

Event Timeline

Blake updated the task description. (Show Details)
Aklapper renamed this task from Northward Datacenter Switchover (Mar. 2026) to Northward Datacenter Switchover (March 2026; codfw to eqiad).Jan 8 2026, 10:03 AM
Blake triaged this task as Medium priority.Jan 15 2026, 9:42 AM

Change #1244621 had a related patch set uploaded (by Blake; author: Blake):

[operations/dns@master] geo-maps: update map default to list eqiad first

https://gerrit.wikimedia.org/r/1244621

Change #1244628 had a related patch set uploaded (by Blake; author: Blake):

[operations/mediawiki-config@master] debug: reorder debug backends for eqiad switchover

https://gerrit.wikimedia.org/r/1244628

Change #1251045 had a related patch set uploaded (by Blake; author: Blake):

[operations/deployment-charts@master] mw-web: upsize for single-DC serving

https://gerrit.wikimedia.org/r/1251045

Change #1251045 merged by jenkins-bot:

[operations/deployment-charts@master] mw-web: upsize for single-DC serving

https://gerrit.wikimedia.org/r/1251045

Pooled status of services pre-switchover:

Service                       Type           eqiad     codfw                                                                 
=================================================================                                                            
apertium                      Active/Active  pooled    pooled                                                                
api-gateway                   Active/Active  pooled    pooled                                                                
apt                           Active/Passive pooled                                                                          
apus                          Active/Active  pooled    pooled                                                                
citoid                        Active/Active  pooled    pooled                                                                
config-master                 Active/Active  pooled    pooled                                                                
cxserver                      Active/Active  pooled    pooled                                                                
device-analytics              Active/Active  pooled    pooled                                                                
docker-registry               Active/Passive           pooled                                                                
echostore                     Active/Active  pooled    pooled                                                                
eventgate-analytics           Active/Active  pooled    pooled                                                                
eventgate-analytics-external  Active/Active  pooled    pooled                                                                
eventgate-logging-external    Active/Active  pooled    pooled                                                                
eventgate-main                Active/Active  pooled    pooled                                                                
eventstreams                  Active/Active  pooled    pooled                                                                
eventstreams-internal         Active/Active  pooled    pooled                                                                
helm-charts                   Active/Active  pooled    pooled                                                                
inference                     Active/Active  pooled    pooled                                                                
k8s-ingress-aux-ro            Active/Active  pooled    pooled                                                                
k8s-ingress-aux-rw            Active/Passive pooled                                                                          
k8s-ingress-dse               Active/Passive pooled                                                                          
k8s-ingress-dse-aa            Active/Active  pooled    pooled                                                                
k8s-ingress-ml-serve          Active/Active  pooled    pooled                                                                
k8s-ingress-ml-staging        Active/Passive           pooled                                                                
k8s-ingress-staging           Active/Passive pooled                                                                          
k8s-ingress-wikikube-ro       Active/Active  pooled    pooled
k8s-ingress-wikikube-rw       Active/Passive           pooled                                                        
kartotherian                  Active/Active  pooled    pooled                                                                
linkrecommendation            Active/Active  pooled    pooled                                                                
logstash                      Active/Passive pooled                                                                          
mathoid                       Active/Active  pooled    pooled                                                                
mobileapps                    Active/Active  pooled    pooled                                                                
mw-api-ext                    Active/Passive           pooled                                                                
mw-api-ext-next               Active/Passive           pooled                                                                
mw-api-ext-next-ro            Active/Active  pooled    pooled                                                                
mw-api-ext-ro                 Active/Active  pooled    pooled                                                                
mw-api-int                    Active/Passive           pooled                                                                
mw-api-int-ro                 Active/Active  pooled    pooled                                                                
mw-jobrunner                  Active/Passive           pooled                                                                
mw-parsoid                    Active/Passive           pooled                                                                
mw-web                        Active/Passive           pooled                                                                
mw-web-next                   Active/Passive           pooled                                                                
mw-web-next-ro                Active/Active  pooled    pooled                                                                
mw-web-ro                     Active/Active  pooled    pooled                                                                
mwdebug                       Active/Active            pooled                                                                
mwdebug-next                  Active/Active            pooled                                                                
netbox                        Active/Passive pooled                                                                          
pki                           Active/Active  pooled    pooled                                                                
proton                        Active/Active  pooled    pooled                                                                
proxoid                       Active/Active  pooled    pooled                                                                
puppetboard                   Active/Active  pooled    pooled
puppetdb-api                  Active/Active  pooled    pooled    
push-notifications            Active/Active  pooled    pooled    
recommendation-api            Active/Active  pooled    pooled    
releases                      Active/Active  pooled    pooled    
rest-gateway                  Active/Passive           pooled    
rest-gateway-ro               Active/Active  pooled    pooled    
restbase                      Active/Active  pooled    pooled    
restbase-async                Active/Active  pooled    pooled    
schema                        Active/Active  pooled    pooled    
search                        Active/Active  pooled    pooled    
search-omega                  Active/Active  pooled    pooled    
search-psi                    Active/Active  pooled    pooled    
sessionstore                  Active/Active  pooled    pooled    
shellbox                      Active/Active  pooled    pooled    
shellbox-constraints          Active/Active  pooled    pooled    
shellbox-media                Active/Active  pooled    pooled    
shellbox-syntaxhighlight      Active/Active  pooled    pooled    
shellbox-timeline             Active/Active  pooled    pooled    
shellbox-video                Active/Active  pooled    pooled    
swift                         Active/Active  pooled    pooled    
tegola-vector-tiles           Active/Active  pooled    pooled    
termbox                       Active/Active  pooled    pooled    
thanos-query                  Active/Active  pooled    pooled    
thanos-swift                  Active/Active  pooled    pooled    
thanos-web                    Active/Active  pooled    pooled    
toolhub                       Active/Passive pooled              
wcqs                          Active/Active  pooled    pooled    
wdqs-internal-main            Active/Active  pooled    pooled    
wdqs-internal-scholarly       Active/Active  pooled    pooled    
wdqs-main                     Active/Active  pooled    pooled    
wdqs-scholarly                Active/Active  pooled    pooled    
wikifeeds                     Active/Active  pooled    pooled    
zotero                        Active/Active  pooled    pooled

k8s-ingress-wikikube-rw, rest-gateway are not excluded from the switchover in hieradata, but are marked active/passive. This is going to result in a day of cross-dc calls for services behind ingress and the rest-gateway, but that's known and acceptable.

I had a brief worried moment where it looked like the mediawiki services were not excluded, but they're excluded separately from the exclude_from_switchover hieradata property.

blake@cumin1003 - Cookbook cookbooks.sre.discovery.datacenter depool all services in codfw: Datacenter Switchover - T413974 started.

Mentioned in SAL (#wikimedia-operations) [2026-03-24T15:20:16Z] <blake@cumin1003> START - Cookbook sre.discovery.datacenter depool all services in codfw: Datacenter Switchover - T413974

blake@cumin1003 - Cookbook cookbooks.sre.discovery.datacenter depool all services in codfw: Datacenter Switchover - T413974 completed.

Mentioned in SAL (#wikimedia-operations) [2026-03-24T15:46:33Z] <blake@cumin1003> END (PASS) - Cookbook sre.discovery.datacenter (exit_code=0) depool all services in codfw: Datacenter Switchover - T413974

I haven't noticed the read-only time. Was it shorter than usual?

The read-only time has not yet started - it's targeted for 15:00 UTC today.

The read-only time is over and the switchover has been completed successfully. Thank you!

Change #1244621 merged by Blake:

[operations/dns@master] geo-maps: update map default to list eqiad first

https://gerrit.wikimedia.org/r/1244621

Change #1244628 merged by jenkins-bot:

[operations/mediawiki-config@master] debug: reorder debug backends for eqiad switchover

https://gerrit.wikimedia.org/r/1244628

Mentioned in SAL (#wikimedia-operations) [2026-03-25T15:34:39Z] <blake@deploy2002> Started scap sync-world: Backport for [[gerrit:1244628|debug: reorder debug backends for eqiad switchover (T413974)]]

Mentioned in SAL (#wikimedia-operations) [2026-03-25T15:37:00Z] <blake@deploy2002> blake: Backport for [[gerrit:1244628|debug: reorder debug backends for eqiad switchover (T413974)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there.

Mentioned in SAL (#wikimedia-operations) [2026-03-25T15:42:20Z] <blake@deploy2002> Finished scap sync-world: Backport for [[gerrit:1244628|debug: reorder debug backends for eqiad switchover (T413974)]] (duration: 07m 41s)

Blake reopened this task as Open.

I'll leave this open until we repool next week.

The read-only time is over and the switchover has been completed successfully. Thank you!

How long was it?

From timestamps in IRC, the RO time was 02:28.832528, just under 2 and a half minutes.

Change #1261464 had a related patch set uploaded (by Blake; author: Blake):

[operations/dns@master] wmnet: update deployment CNAME record to deploy2002

https://gerrit.wikimedia.org/r/1261464

Change #1261465 had a related patch set uploaded (by Blake; author: Blake):

[operations/puppet@production] hieradata: update deployment_server to deploy1003

https://gerrit.wikimedia.org/r/1261465

Icinga downtime and Alertmanager silence (ID=c9e3e1df-943b-431c-a9ff-53adfb5c4229) set by blake@cumin1003 for 2:00:00 on 2 host(s) and their services with reason: Deployment server switchover

releases2003.codfw.wmnet,releases1003.eqiad.wmnet

Change #1261464 merged by Blake:

[operations/dns@master] wmnet: update deployment CNAME record to deploy1003

https://gerrit.wikimedia.org/r/1261464

Change #1261465 merged by Blake:

[operations/puppet@production] hieradata: update deployment_server to deploy1003

https://gerrit.wikimedia.org/r/1261465

Mentioned in SAL (#wikimedia-operations) [2026-03-26T15:46:03Z] <blake@deploy1003> Started scap sync-world: Test deployment to validate deployment server switchover - T413974

Mentioned in SAL (#wikimedia-operations) [2026-03-26T16:17:12Z] <blake@deploy1003> Finished scap sync-world: Test deployment to validate deployment server switchover - T413974 (duration: 31m 09s)

Change #1266213 had a related patch set uploaded (by Blake; author: Blake):

[operations/deployment-charts@master] mw-web: downsize for multi-DC serving

https://gerrit.wikimedia.org/r/1266213

Hi folks, for visibility, we will be repooling codfw tomorrow, April 2nd at 14:00 UTC, rather than 15:00 UTC.

Codfw has been successfully repooled as of 13:41 UTC this morning. The cookbooks have mistakenly been tag logged a different task, so pasting the summaries below:

[13:18 UTC] jasmine@cumin1003 - Cookbook cookbooks.sre.discovery.datacenter pool all active/active services in codfw: maintenance -  started.
[13:41 UTC] jasmine@cumin1003 - Cookbook cookbooks.sre.discovery.datacenter pool all active/active services in codfw: maintenance -  completed.

Additionally some notes:

  1. We repooled codfw slightly earlier than anticipated (13:18 UTC) in an effort to mitigate T422130.
  2. We encountered T422166 following the repool which captures an unexpected split brain situation following the move of restricted registry to s3-on-apus.

Change #1266213 merged by jenkins-bot:

[operations/deployment-charts@master] mw-web: downsize for multi-DC serving

https://gerrit.wikimedia.org/r/1266213

Now that we've repooled and resized, closing this out.