Page MenuHomePhabricator

March 2023 Datacenter Switchover
Closed, ResolvedPublic

Description

This is the meta task for the March 2023 Datacenter switchover (eqiad -> codfw).

Switchover

Schedule

Services: Tuesday, February 28th, 2023 14:00 UTC
Traffic: Tuesday, February 28th, 2023 15:00 UTC
MediaWiki: Wednesday, March 1st, 2023 14:00 UTC

Repooling

Traffic repooling of eqiad: Wednesday, March 8th, 2023
restbase-async eqiad pooling: Wednesday, March 8th, 2023
Services/mediawiki-RO eqiad pooling: Tuesday, March 14th, 2023

Checklist

See also:

Switchback

Services: Tuesday, April 25th, 2023 14:00 UTC
Switching back: Wednesday, April 26th, 2023 14:00 UTC

Checklist

Related Objects

StatusSubtypeAssignedTask
ResolvedClement_Goubert
ResolvedTrizek-WMF
ResolvedClement_Goubert
ResolvedRLazarus
ResolvedClement_Goubert
ResolvedClement_Goubert
ResolvedClement_Goubert
ResolvedClement_Goubert
ResolvedClement_Goubert
ResolvedClement_Goubert
ResolvedBUG REPORTClement_Goubert
ResolvedClement_Goubert
ResolvedClement_Goubert
ResolvedClement_Goubert
ResolvedClement_Goubert
OpenClement_Goubert
OpenNone
OpenNone
OpenNone
Resolved Marostegui
ResolvedAndrew
Resolved Marostegui
ResolvedAndrew
DeclinedAndrew
ResolvedAndrew
ResolvedAndrew
ResolvedLadsgroup
DuplicateNone
Resolved Bstorm
DeclinedNone
Resolvedtaavi
ResolvedJdforrester-WMF
DeclinedNone
Openjijiki
OpenNone
Resolvedjbond
OpenNone
OpenNone
ResolvedBUG REPORTClement_Goubert
In ProgressClement_Goubert
OpenNone
ResolvedClement_Goubert
ResolvedClement_Goubert
Resolvedeoghan
Resolvedeoghan
Resolvedjbond
ResolvedDzahn
ResolvedDzahn
ResolvedClement_Goubert
ResolvedClement_Goubert
ResolvedClement_Goubert
Resolved Marostegui
ResolvedClement_Goubert
DeclinedDzahn
Resolvedayounsi
Invalid Marostegui
Resolved Marostegui
Resolved Marostegui
Resolved Marostegui
ResolvedClement_Goubert
ResolvedClement_Goubert
ResolvedClement_Goubert
OpenNone
Resolvedcmooney
Resolvedcmooney
Resolved Marostegui
Resolved Marostegui
Resolved Marostegui
Resolved Marostegui
Resolved Marostegui
Resolved Marostegui
Resolved Marostegui
Resolvedayounsi
ResolvedLadsgroup
Resolvedherron
Resolvedherron
Declinedherron
Openherron
ResolvedJclark-ctr
ResolvedJclark-ctr
ResolvedJoe
Resolved Cmjohnson
ResolvedJclark-ctr
ResolvedRequestJclark-ctr
Resolvedsgrabarczuk
ResolvedClement_Goubert
Resolved Marostegui
Resolved Marostegui

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

MwHttpRequest (that is, Guzzle/php-curl) and the URLs from https://wikitech.wikimedia.org/wiki/Analytics/AQS/Pageviews. I don't know if RESTBase is involved in that in some way. If you mean the VirtualRESTServiceClient in MediaWiki, that's not used (so there is no parallelism). See WikimediaPageViewService for the code.

Those URLs are RESTBase alright. And if the are being used as is from that page, that is.

  • e.g. GET https://wikimedia.org/api/rest_v1/metrics/pageviews/per-article/en.wikipedia/all-access/all-agents/Albert_Einstein/daily/2015100100/2015103100

vs the internal service-mesh

  • e.g. http://localhost:6011/wikimedia.org/v1/metrics/pageviews/per-article/en.wikipedia/all-access/all-agents/Albert_Einstein/daily/2015100100/2015103100

then MediaWiki goes via the edge caches, incurring some extra latency since there are another 3 hops before getting the actual serving cluster. In any case, something to fix after the switchover, thanks for the heads up, we 'll keep it in mind in case we have big issues.

Mentioned in SAL (#wikimedia-operations) [2023-03-01T13:10:04Z] <claime> Adding scheduled maintenance for switchover to statuspage - T327920

Mentioned in SAL (#wikimedia-operations) [2023-03-01T13:31:04Z] <claime> Locking scap deployments for datacenter switchover - T327920

Mentioned in SAL (#wikimedia-operations) [2023-03-01T13:40:10Z] <claime> Starting mediawiki datacenter switchover step 0 - T327920

Change 891552 merged by Clément Goubert:

[operations/dns@master] db: Switch dns master alias to codfw

https://gerrit.wikimedia.org/r/891552

Mentioned in SAL (#wikimedia-operations) [2023-03-01T14:08:18Z] <claime> Phase 9.5 Update DNS records for new database masters - T327920

Mentioned in SAL (#wikimedia-operations) [2023-03-01T14:09:56Z] <claime> Phase 9.5 DNS records for new database masters updated - T327920

Change 893479 had a related patch set uploaded (by Marostegui; author: Marostegui):

[operations/dns@master] wmnet: Update pcX DNS

https://gerrit.wikimedia.org/r/893479

Change 892428 merged by jenkins-bot:

[operations/mediawiki-config@master] debug.json: List primary DC servers first

https://gerrit.wikimedia.org/r/892428

Mentioned in SAL (#wikimedia-operations) [2023-03-01T14:18:24Z] <cgoubert@deploy2002> Started scap: Backport for [[gerrit:892428|debug.json: List primary DC servers first (T327920)]]

Mentioned in SAL (#wikimedia-operations) [2023-03-01T14:20:30Z] <cgoubert@deploy2002> cgoubert: Backport for [[gerrit:892428|debug.json: List primary DC servers first (T327920)]] synced to the testservers: mwdebug1002.eqiad.wmnet, mwdebug2002.codfw.wmnet, mwdebug2001.codfw.wmnet, mwdebug1001.eqiad.wmnet

Change 893479 merged by Marostegui:

[operations/dns@master] wmnet: Update pcX DNS

https://gerrit.wikimedia.org/r/893479

Mentioned in SAL (#wikimedia-operations) [2023-03-01T14:26:18Z] <cgoubert@deploy2002> Finished scap: Backport for [[gerrit:892428|debug.json: List primary DC servers first (T327920)]] (duration: 07m 54s)

Mentioned in SAL (#wikimedia-operations) [2023-03-01T14:27:02Z] <claime> End mediawiki datacenter switchover - T327920

Change 893675 had a related patch set uploaded (by Zabe; author: Zabe):

[operations/dns@master] wmnet: Update maintenance.eqiad.wmnet to point to mwmaint2002

https://gerrit.wikimedia.org/r/893675

Change 893675 merged by Clément Goubert:

[operations/dns@master] wmnet: Update maintenance.eqiad.wmnet to point to mwmaint2002

https://gerrit.wikimedia.org/r/893675

And if the are being used as is from that page, that is.

  • e.g. GET https://wikimedia.org/api/rest_v1/metrics/pageviews/per-article/en.wikipedia/all-access/all-agents/Albert_Einstein/daily/2015100100/2015103100

vs the internal service-mesh

  • e.g. http://localhost:6011/wikimedia.org/v1/metrics/pageviews/per-article/en.wikipedia/all-access/all-agents/Albert_Einstein/daily/2015100100/2015103100

then MediaWiki goes via the edge caches, incurring some extra latency since there are another 3 hops before getting the actual serving cluster.

Yeah, it uses the public URL. Ping me when it is a good time to fix that, it should be trivial.

Mentioned in SAL (#wikimedia-operations) [2023-04-24T10:27:40Z] <claime> Datacenter switchover live testing setting db to read-only and back in eqiad - T327920

Mentioned in SAL (#wikimedia-operations) [2023-04-24T10:29:58Z] <claime> Datacenter switchover live testing setting db to read-only and back in eqiad successful - T327920

Change 911780 had a related patch set uploaded (by Clément Goubert; author: Clément Goubert):

[operations/cookbooks@master] sre.switchdc.mediawiki: Add mw-api-int to mediawiki services

https://gerrit.wikimedia.org/r/911780

Change 911780 merged by jenkins-bot:

[operations/cookbooks@master] sre.switchdc.mediawiki: Add mw-api-int to mediawiki services

https://gerrit.wikimedia.org/r/911780

Change 912171 had a related patch set uploaded (by Marostegui; author: Marostegui):

[operations/dns@master] wmnet: Update parsercache CNAME

https://gerrit.wikimedia.org/r/912171

Change 912235 had a related patch set uploaded (by Clément Goubert; author: Clément Goubert):

[operations/dns@master] db: Switch dns master alias to eqiad

https://gerrit.wikimedia.org/r/912235

Mentioned in SAL (#wikimedia-operations) [2023-04-26T13:13:59Z] <claime> Locking scap for datacenter switchback - T327920

Mentioned in SAL (#wikimedia-operations) [2023-04-26T13:15:43Z] <cgoubert@deploy1002> Locking from deployment [ALL REPOSITORIES]: Datacenter Switchback - T327920

Mentioned in SAL (#wikimedia-operations) [2023-04-26T13:23:27Z] <claime> Starting mediawiki datacenter switchback preparation - T327920

Mentioned in SAL (#wikimedia-operations) [2023-04-26T13:45:16Z] <claime> Stopping maintenance scripts for datacenter switchback - T327920

Mentioned in SAL (#wikimedia-operations) [2023-04-26T13:59:49Z] <claime> Going to read-only for mediawiki datacenter switchback - T327920

Mentioned in SAL (#wikimedia-operations) [2023-04-26T14:05:00Z] <claime> Restarting maintenance jobs - T327920

Change 912235 merged by Clément Goubert:

[operations/dns@master] db: Switch dns master alias to eqiad

https://gerrit.wikimedia.org/r/912235

Change 912302 had a related patch set uploaded (by Marostegui; author: Marostegui):

[operations/dns@master] wmnet: Update parsercache CNAME

https://gerrit.wikimedia.org/r/912302

Change 912171 abandoned by Marostegui:

[operations/dns@master] wmnet: Update parsercache CNAME

Reason:

I messed up the rebase, so abandoning in favour of 912302

https://gerrit.wikimedia.org/r/912171

Change 912302 merged by Marostegui:

[operations/dns@master] wmnet: Update parsercache CNAME

https://gerrit.wikimedia.org/r/912302

Mentioned in SAL (#wikimedia-operations) [2023-04-26T14:16:00Z] <marostegui> Update dns for parsercache T327920

Mentioned in SAL (#wikimedia-operations) [2023-04-26T14:24:46Z] <cgoubert@deploy1002> Unlocked for deployment [ALL REPOSITORIES]: Datacenter Switchback - T327920 (duration: 69m 03s)

Clement_Goubert updated the task description. (Show Details)