Page MenuHomePhabricator

September 2021 Datacenter switchover (codfw -> eqiad)
Closed, ResolvedPublic

Description

This is the meta task for the September 2021 Datacenter switchover (codfw -> eqiad).

Schedule:

Services Monday, Sept 13th 14:00 UTC
Traffic Monday, Sept 13th 15:00 UTC
MediaWiki Tuesday, Sept 14th 14:00 UTC

This schedule should be considered confirmed if there are no reasonable objections to these dates by Aug 2 (a little less than a week from now). Schedule is confirmed now.

Previously: T281515: June 2021 Datacenter switchover

See also: https://wikitech.wikimedia.org/wiki/Switch_Datacenter

Event Timeline

Legoktm triaged this task as Medium priority.Jul 27 2021, 11:45 PM
Legoktm created this task.

Do we have deployment this week? T281164: 1.37.0-wmf.23 deployment blockers has been created as usual, covering the Train Deployment for the week of September 13th.

Do we have deployment this week? T281164: 1.37.0-wmf.23 deployment blockers has been created as usual, covering the Train Deployment for the week of September 13th.

Yes, with the caveat that if something goes wrong/poorly with the switchover it might be delayed or cancelled. We did it last time and it seemed to work out fine.

Thank you @Legoktm, I updated our public messages accordingly.

Heads up, ATM swift traffic is in eqiad because of codfw hw rebalance (T288458). The eqiad swift hardware is ready to be put in service now, I'll be flipping swift (read) traffic back to codfw tomorrow and kick off at least an eqiad initial rebalance since the cluster is ~93% full now. We can decide by Tues on what to do with Swift read traffic, it should be fine to either: move fully to eqiad, remain in codfw or active/active.

Change 719556 had a related patch set uploaded (by Legoktm; author: Legoktm):

[operations/cookbooks@master] sre.switchdc.services: Temporarily exclude swift

https://gerrit.wikimedia.org/r/719556

Change 719556 merged by Filippo Giunchedi:

[operations/cookbooks@master] sre.switchdc.services: Temporarily exclude swift

https://gerrit.wikimedia.org/r/719556

Mentioned in SAL (#wikimedia-operations) [2021-09-09T08:59:32Z] <godog> move swift traffic fully to codfw to rebalance eqiad - T287539

Change 720776 had a related patch set uploaded (by Legoktm; author: Jelto):

[operations/dns@master] traffic: Depool codfw from user traffic for switchover

https://gerrit.wikimedia.org/r/720776

Change 720776 merged by Jelto:

[operations/dns@master] traffic: Depool codfw from user traffic for switchover

https://gerrit.wikimedia.org/r/720776

The notice that has just gone up at English Wikipedia reads

<span class="cbnnr-headline">Technical maintenance will be performed soon</span>
<span class="cbnnr-text"><span dir="ltr">06:00 UTC - 06:30 UTC</span></span>
<span class="cbnnr-cta">During this time you might not be able to save any edits.</span>

06:00 UTC was several hours ago.

The notice that has just gone up at English Wikipedia reads
<snip>

Yeah, I think the wrong banner went up accidentally, see T287546#7351929.

Five minutes ago it said "15:00 UTC - 16:00 UTC". It's now reading "14:00 UTC - 15: 00 UTC". I get the impression that the devs concerned are in a non-European timezone and don't know how their local TZ relates to UTC. The time as I write this is 14:39 UTC. This is also 14:39 GMT, and 15:39 BST.

Change 721008 had a related patch set uploaded (by Marostegui; author: Marostegui):

[operations/dns@master] wmnet: Update db masters aliases

https://gerrit.wikimedia.org/r/721008

Change 721008 merged by Marostegui:

[operations/dns@master] wmnet: Update db masters aliases

https://gerrit.wikimedia.org/r/721008

Five minutes ago it said "15:00 UTC - 16:00 UTC". It's now reading "14:00 UTC - 15: 00 UTC". I get the impression that the devs concerned are in a non-European timezone and don't know how their local TZ relates to UTC. The time as I write this is 14:39 UTC. This is also 14:39 GMT, and 15:39 BST.

The operation has been delayed a bit, so the banner has been adjusted.

I sent a short recap to wikitech-l: https://lists.wikimedia.org/hyperkitty/list/wikitech-l@lists.wikimedia.org/message/6UZCCACCBCZLN5MHROZQXUG6ZOQTDCLO/ - we were read-only for 2m42s.

The main thing left to do before resolving this is to repool codfw caches/traffic on Monday.

Change 722397 had a related patch set uploaded (by Legoktm; author: Legoktm):

[operations/dns@master] Revert \"traffic: Depool codfw from user traffic for switchover\"

https://gerrit.wikimedia.org/r/722397

Change 722397 merged by Legoktm:

[operations/dns@master] Revert \"traffic: Depool codfw from user traffic for switchover\"

https://gerrit.wikimedia.org/r/722397

Legoktm claimed this task.

All done!