Page MenuHomePhabricator

MoveComms support for Southward Datacenter Switchover (September 2024)
Closed, ResolvedPublic

Description

Dear MoveComms,

We are planning a Datacenter Switchover (eqiad to codfw) for the week of September 23rd with the following schedule:

The expected impact is 2-3 minutes of read-only on Wednesday, 25 September 2024 @ 15:00 UTC.

Note: The above times are all 15:00 UTC (i.e., 1 hour later than the 14:00 UTC target of the March 2024 switchover).

As with the last two switchovers, we are following the process described in Recurring, Equinox-based, Data Center Switchovers, in particular:

  • There is no switchback! We are staying in codfw until the next switchover.
  • Switchovers are predictable and take place every 6 months; always on the week of an equinox.

See T358233 for the equivalent task tracking previous (March 2024) switchover.

Let serviceops know if you need more info on the changes.

Thank you!

Planning

As soon as the task is received by Movement Communications

Three weeks before

Two weeks before

The week before

The week it happens

  • monitor the wikis, Monday
  • monitor the wikis, Tuesday
  • when the read-only is done and confirmed by SRE, remove the banner.

The week after

  • Debrief how it went, and document (comment on this task)

Event Timeline

Trizek-WMF triaged this task as Medium priority.

Switching from 14:00 to 15:00 UTC required extra checks on the existing translations showing the time. This is now done, but I'll check them next week once again.

Switching from 14:00 to 15:00 UTC required extra checks on the existing translations showing the time. This is now done, but I'll check them next week once again.

That's an interesting point that we missed when discussing the switch from 14:00 UTC to 15:00 UTC. We 'll update our documentation as well to point that out.

Switching from 14:00 to 15:00 UTC required extra checks on the existing translations showing the time. This is now done, but I'll check them next week once again.

That's an interesting point that we missed when discussing the switch from 14:00 UTC to 15:00 UTC. We 'll update our documentation as well to point that out.

Done at https://wikitech.wikimedia.org/w/index.php?title=Switch_Datacenter/Coordination&diff=prev&oldid=2226701

Thank you for catching that @Trizek-WMF and @akosiaris for updating the docs.

Read-only phase is done. I removed the banner at 15:05 UTC.


Caught during the read-only phase: when you click Edit, you get an error message from Parsoid, not from the read-only:

parsoid.jpeg (502×1 px, 160 KB)

Filed as T375638: When entering editing with VE while the read-only mode is set, provide a message related to the read-only, not about Parsoid not loading.

The retro is a bit late, sorry.

  • Great work with @Scott_French who nicely answered my questions and, on the day of the read-only, kept me updated in real-time regarding completion.
  • Announcements went well, communities are used to it now. No particular feedback about the process.
  • We are still relying on volunteers to review the announcement templates we use (message and banner). This is why we designed the announcement and the banner in a way that requires very minimal to no changes to existing translations.
    • Changing the hour of the event was a lot of work for Movement Communication to review and change local times in translated messages (see T371130#10141416). @akosiaris, would it be possible to keep 15:00 UTC as the read-only time in the future?

The retro is a bit late, sorry.

Thanks for the retro, we appreciate it.

  • Changing the hour of the event was a lot of work for Movement Communication to review and change local times in translated messages (see T371130#10141416). @akosiaris, would it be possible to keep 15:00 UTC as the read-only time in the future?

Overall we are targetting 14:00 UTC. We moved it this time around at the request of the point person to accommodate for them being in a difficult TZ. We even amended the docs to say

Disruptive operations such as the MediaWiki Switchover (see below) will target 14:00 UTC as their start time. However, SRE Service Operations reserves the right to adjust this by up to +/- 2h with sufficient prior notification.

Given the realities of a global distributed workforce and the fact we can have people between UTC+10 to UTC-11, I don't think we are in a position, at least as things are set up right now, to set it more in stone that what we currently have. As a sugar coat at least, given that SRE currently is spread across UTC+3 to UTC-8, I don't see us diverging drastically from the above targeted time any time soon, that is we won't suddenly say "we are now targeting 20:00 UTC"

As an alternative, is there any way we could help with automating the work to change local times? Maybe we can write a bot for you that does this based on some source of truth?

The reasons you give make a lot of sense. I'll document our process so that the person in charge books the time needed to change local times in the different messages, if not done by a volunteer first.

Using a bot would be nice, but we don't have the skills within the team to have one running. :) Maybe a template based on the language shown... I'll find something!

The reasons you give make a lot of sense. I'll document our process so that the person in charge books the time needed to change local times in the different messages, if not done by a volunteer first.

Using a bot would be nice, but we don't have the skills within the team to have one running. :)

We do. We can probably code one to run in Toolforge and update all pages you want.

We already have all future switchovers up to 2050 documented at https://wikitech.wikimedia.org/wiki/Switch_Datacenter/Switchover_Dates (alongside the very simple code that generated it). We 'll need to craft some way to override the 14:00 UTC part (when applicable, easy enough) and a list of all the pages you want updated (we can start with sandbox copies of those) for every switchover. And a way for you to update that list of pages. The assumption being of course that the bot will only be altering the Date and Time (remains to be seen how exactly it will be limited to just that) and nothing else.

We can probably put in in the APP as a hypothesis for the quarter before the next switchover (March 19th 2025) . So Calendar year Q1 of 2025.