Page MenuHomePhabricator

CommRel support for March 2023 Datacenter Switchover
Closed, ResolvedPublic

Description

Hi CommRel,

We are planning a datacenter switchover for the week of March 1st (week 9) with the following schedule:

The switchback date is currently set to Wednesday, April 26th, 2023 (week 17)

As the June 2021 switchover resulted in 1m57s of read-only time, do you think it is reasonable to forgo putting up a banner this time?

Thanks!

Planning

As soon as the task is received by CRS

  • Ask SRE if anything major changed since the last time (noticeable things that are worth being announced).
    • If yes, update the process or the message.
  • Confirm precisely when the wikis will be in read-only
  • Add the date to Asana, with a link to this task

Three weeks before (Feb 6 - week 6)

  • Tech News message (initial warning)
  • Update the message to communities
    • Check on dates and links
  • Have the message being translated by emailing both:
    • translators-l for translations
    • wikitech-l for information and translations
  • Monitor the message's talk page

Two weeks before (Feb 13 - week 7)

The week before (Feb 20 - week 8)

  • Tech News message (reminder)
  • Emailing mailing lists:
    • wikitech-l and translators once again
    • wmfall (done by Ops)
  • Added to the news on the Meta front page
  • Send the message to communities

The week it happens (Feb 27 - week 9)

  • Switch of all traffic back to the primary data center on 1 March.

The week after (March 6 - week 10)

  • Debrief how it went, and document (comment on this task)

Details

Other Assignee
RZamora-WMF

Event Timeline

Clement_Goubert moved this task from Incoming 🐫 to this.quarter 🍕 on the serviceops board.
Clement_Goubert moved this task from Backlog to Radar on the SRE board.

@Clement_Goubert Has anything major changed in your process since the last time (noticeable things that are worth being announced)?

As you gave 3 dates in the task description, can you confirm precisely when the wikis will be in a read-only mode noticeable by anyone? Thank you!

As you gave 3 dates in the task description, can you confirm precisely when the wikis will be in a read-only mode noticeable by anyone? Thank you!

The wikis will be read-only briefly Wednesday, March 1st, 2023

@Clement_Goubert Has anything major changed in your process since the last time (noticeable things that are worth being announced)?

There's at least one big thing, it's the first switchover since we went multi-DC. That has a number of implications, but the biggest is that since codfw is already getting read traffic, cache warmup should be a lot shorter (or inexistant, with a bit of luck).

I'm gathering other possible changes from the rest of SRE.

RZamora-WMF assigned this task to Trizek-WMF.
RZamora-WMF updated Other Assignee, added: RZamora-WMF.

Thank you Clément.

@RZamora-WMF is my backup for this task. She will review all steps when I made them.

Task follow-up:

Task follow-up:

I will send the emails for translations when these two items will have been checked.

Trizek-WMF changed the task status from Open to In Progress.Feb 6 2023, 5:41 PM

Apart from multi-DC, the other possibly notable thing is that a Gitlab switchover will also be performed. I'll let @LSobanski and/or @thcipriani decide if it's worth communicating since it'll affect the technical community.

It is worth communicating anything that disturbs one's habits. :) Better safe than sorry!

@Clement_Goubert @LSobanski @thcipriani

I'd like to ping translators before the end of this week. Before this, it is better to add any relevant information to the message.

Can you confirm if there is any possible disturbance regarding the Gitlab switchover? Are other services, such as mailing lists, Etherpad, or Phabricator concerned by this switchover?

It is not a matter of deciding if we will not announce a disturbance if there are 99% chances of things going okay. I'm looking for that 1% chance of a minimal disturbance that could be noticed, reported, and become a drama. As I wrote often, "better safe than sorry!" :)

I'll let @LSobanski answer authoritatively for Phabricator and Etherpad.
We are not switching over the mailing lists, nor are we switching over WDQS.
For more information, excluded services (service catalog names) are tracked at T329193: March 2023 Datacenter Switchover Excluded services

  • GitLab failover requires a ~1.5h maintenance window during which GitLab will be unavailable.
  • We won't be switching Phabricator or Etherpad over.

While not directly linked to the switchover as it does not have a codfw deployment, Toolhub will probably be impacted by the eqiad cluster's kubernetes 1.23 upgrade that will happen during the switched over period.

Central Notice

Other tasks

I checked all translations regarding time and date. I had to fix all of them manually, at least the ones that were already completely translated but with this time and date change to be made. Translators involvement was not as important as it used to be for these operations (even if more important than usual).

The most important thing is that the time and date are now correctly displayed for all those languages, even if the grammar might not be correct.

I pinged users who formerly translated the message. Also, further updates will happen when the message is distributed to village pumps (on Monday). Hopefully, it will be okay for the 30-minute display on the banner on Wednesday.

As the switchover happens on Wednesday, I think it is better to distribute the message to village pumps on Monday. This way, we can reply to questions quite immediately, compared to posting this message now, and letting communities ask questions and then guess because we don't reply. :)

After multiple reviews, fixes, and the last translations being done, the message has been sent to 832 community pages + 419 pages used for bots coordination.

I'll check on the banners' translations tomorrow (they have to be manually approved).

It happend.

The next step, next week: debrief the process.

A post-action document has been created. There is nothing special to highlight beyond some improvements to be made, that will be covered through T292543: Improve the community relations process for data center switchover.

We noted the will of SRE to improve the process in order to reduce the read-only time to zero, which would also reduce the work on these types of tasks.

See you for the switchback!