Page MenuHomePhabricator

CommRel support for FY2020-2021 Q1 DC switchover
Closed, ResolvedPublic

Description

Hi CommRel! Our planned Q3 data center switchover is now a planned Q4 data center switchover, due to some rearrangement of plans in SRE. We haven't settled on a date yet, or even a month, so this task is a bit of a placeholder for now, but let's move the conversation here anyway.

Just like in previous years, the switchover will involve a brief period of read-only time for all wikis, but we expect it to be shorter than before -- probably somewhere in the neighborhood of three minutes of actual read-only time, within a longer announced window.

Notes from last time: https://office.wikimedia.org/wiki/Community_Relations_Specialists/codfw/2018_lessons

I'll update this when we've picked out a date -- if you have any input on that, I'd love to hear it (Q4 team offsites, community events that we shouldn't interrupt, that sort of thing). Apart from that, let me know if you have any questions about the switchover, and otherwise feel free to ignore this task until the plan solidifies. Thanks!


Date: September 1
Time: between 14:00 and 15:00 UTC

Announcements planning:

Week 33

  • Tech News week 33 (initial warning)
  • contacting translators to have the message being translated.
  • message wikitech-l

Week 34

  • Tech news week 34 (reminder)

Week 35

  • Emailing mailing lists:
    • wikitech-l once again
    • wikimedia-l
    • wmfall (done by Ops)
    • GLAM
    • education
  • Added to the news on the Meta front page
  • Message to communities
    • village pumps
    • bot coordination pages (done on Monday Aug. 31)

Week 36

  • Tech News week 36 (this is happening this week)
  • Banner set for a display on all wikis 30 minutes before it happens

Week 37

  • Retrospective

Event Timeline

Aklapper renamed this task from CommRel support for 2020 Q4 DC switchover to CommRel support for FY2019-2020 Q4 DC switchover.Feb 11 2020, 2:07 AM
jbond triaged this task as Medium priority.Feb 11 2020, 11:51 AM

@RLazarus assuming this is still pretty much in the air?

Yeah, as you can imagine with some folks working reduced hours, SRE is mostly focusing on critical work and we've pushed this off. I haven't asked around in a little bit, but at a guess I'd say that late Q4 is still possible but extremely uncertain.

I'm assuming you're in basically the same situation? I won't take anything you say as a commitment, but it'd be good to know whether this is plausible on your end if we find the time.

We have a lot of requests right now, but if we get enough warning (which your team is always good about), then I think it's likely that someone will be able to help.

We've ruled out a switchover in Q4. We'll continue to do all the non-user-impacting prep work we can, so we might be ready to go early in Q1, if all goes well -- but obviously that's way past the horizon for predicting the continuing impact of COVID-19 on our work capacity, so I'll keep you updated.

I'm retitling the task and moving it to your Jul-Sep workboard accordingly, but given the continuing uncertainty let me know if you'd rather track it differently. Thanks for your patience.

RLazarus renamed this task from CommRel support for FY2019-2020 Q4 DC switchover to CommRel support for FY2020-2021 Q1 DC switchover.May 4 2020, 7:03 PM
Elitre changed the task status from Open to Stalled.May 11 2020, 10:36 AM

Any update since last month? Q1 starts in one month and this task would need some preparation and some scheduling ahead.

Q1 is also summer vacation for a lot of us. :)

Thanks for checking -- not sure yet, but as we're planning out Q1 on our side too, I'm starting to take everyone's temperature about it. I'll let you know as soon as I have some idea whether it will happen, and I'll make sure to clear any potential dates with you.

Thanks for checking -- not sure yet, but as we're planning out Q1 on our side too, I'm starting to take everyone's temperature about it. I'll let you know as soon as I have some idea whether it will happen, and I'll make sure to clear any potential dates with you.

It's OK, and thanks a lot for this update!

It looks like we'll try to do this: ideally we'll aim to do the switchover from eqiad to codfw in either mid-to-late August or early September, and the switch back to eqiad about a month later. (I can promise we won't do anything in July; we wouldn't ask you to work on that kind of short notice, and we won't have our act together yet anyway.)

Tentatively, how would you feel about targeting a switchover date of Aug 18, Aug 25, or Sept 1 (all Tuesdays)? Please do strike any of those that are incompatible with your vacations, community events, or other plans -- if none work we can push to later.

I'm fully available to prepare and handle September 1 event.

Thanks @Trizek-WMF! It took a moment to get everything else lined up, but we're moving forward with September 1.

Trizek-WMF moved this task from To Triage to Not ready to announce on the User-notice board.

Is there any information about when it will happen (timeframe) and how long it would last?

The only user-impacting section of the process will be a read-only period for all wikis while we move MediaWiki itself -- that should last about 3-5 minutes, somewhere between 13:30 and 15:30 UTC on 2020-09-01.

For fuller context: We're still putting together a planned timeline for the various phases of the switchover, which will span a couple of days' work -- we'll also be shifting traffic for the cache layer, image storage, and some other odds and ends. But apart from the RO period necessary to move MW, none of those other changes should be user-visible at all, and several of them are routine -- we slosh traffic around for other reasons sometimes. My instinct is that only the RO period needs to be announced, but of course we'll defer to your judgment.

Trizek-WMF changed the task status from Stalled to Open.Aug 3 2020, 2:19 PM

Thank you!

I'm starting the pre-announces.

Reading the parent task, I realized that we will have the switchover and also the switchback. When the switchback would happen? Two weeks after?

No, it'll be roughly a month -- there's a variety of maintenance we'd like to do in eqiad while we're serving from codfw.

Likely candidate dates are September 29 or October 6, but I've been figuring we would finalize one after the switchover is finished. (Since it's been so long, there's a decent chance that the switchover will uncover some work we need to do to improve the process itself, which might affect scheduling -- and a slim but nonzero chance we'll decide to abort it and schedule another attempt.)

If it's better for you, we can pick a date now and change it later if necessary.

Likely candidate dates are September 29 or October 6, but I've been figuring we would finalize one after the switchover is finished.

Good idea. The two dates you had in mind are the ones for my move. :-p

Let's pick a date after the first move.

I plan to reuse the message we used in 2018. This would allow translators to benefit from already existing translations, or could reuse former ones. This way, we would increase our chances to have more translations.

However, I need someone involved in the switchover to check if the 2020 message, based on the previous one, is still accurate concerning technical details.

The 2018 message said that the switchover started at 14:00 UTC. Would it be possible to start the 2020 switch at the same time? Again, it is only to ease translations. :)

Is there a public page on wikitech documenting this switchover? https://wikitech.wikimedia.org/wiki/Switch_Datacenter lists the 2018 switch, not the 2020.

Thank you!

I bet we can do 14:00 UTC. I'm finishing up the timeline with my SRE colleagues this week, I'll confirm and get back to you. After that we'll post on Wikitech, it'll be on the same page.

Everything else about that message still looks good. (We expect the read-only period to be much shorter than an hour, but "up to an hour" is still accurate, and I'm fine with leaving plenty of wiggle room.)

Great!

This message will be posted on wikis on week 35.

I plan to send the announcement to communities tomorrow.

At the moment, https://wikitech.wikimedia.org/wiki/Switch_Datacenter is still detailing the 2018 process. Can you update the page?

Yeah, sorry that's later than I expected -- we're meeting today to confirm the timing details and I'll post the update immediately after, so a little over two hours from now.

@Trizek-WMF Question from @debt earlier, will you be posting to wikitech-l also, or only wikitech-ambassadors? No wrong answer AFAIK, just confirming the plan.

I sent a message to wikitech-l earlier today. Maybe it is pending moderation.

I prepared the banner that will be displayed from 13:30 to 14:05 UTC on all wikis, for both logged-in and logged out users. At 14:05, the global lock maintenance banner will be displayed by ops.

Trizek-WMF raised the priority of this task from Medium to High.Aug 31 2020, 10:23 AM
Trizek-WMF updated the task description. (Show Details)
Trizek-WMF lowered the priority of this task from High to Medium.EditedSep 1 2020, 4:12 PM

Some thoughts about this, in random order. They will be reused for the retro I plan to do next week.

  • I consider it went fine, since I haven't received much questions about this operation.
  • Reusing the message we used previously allowed us to have a lot of translations.
  • We may also had more translations since I double posted the translation request on translators-l and on wikitech-ambassadors
  • This message should mention that banner will be displayed, because some users asked me about it (example, example).
  • Warnings distributed on Tech News have been noticed, some users asked if they were scheduled for forthcoming (and they were).
  • Two people were surprised to get this message on Bot coordination pages (example). I argued that posting the information at multiple place is important for this kind of operation.
  • Caching makes the banner to appear one or two minutes after the scheduled time (some folks reacted about it on the IRC ops channel). Maybe consider to schedule them two minutes before the planned time.
  • Some people (example) haven't seen any changes in their editing habits during the operation (me included), and where surprised about how fast it went.
  • The banner got improved for Monobook users
  • A user asked a question about how our servers work

Let's warp-up and close:

Planning: the planning set, based on previous switchover, is now stable. We can safely reuse it.

Communication went fine.

  • Reusing the previously used message allows a lot of translation work to be reused. This is something to keep.
  • Post translation request on translators-l and on wikitech-ambassadors raised up the number of translations being made.
  • This message is perceived as a generic one, to be posted at central places. If this message is posted at different plages (like on bot coordination pages), it should be explained why.
  • The next message should say that banners will be displayed by staff.

Overall, we have a good system there. I documented it on office wiki: https://office.wikimedia.org/wiki/Community_Relations_Specialists/codfw

Bonus question: when is the switch back? :)

Thank you @Trizek-WMF! Really appreciate all your work on this, and your team's.

One note for you, maybe to do with translation? Before we were read-only, @MoritzMuehlenhoff noted that dewiki was showing two messages, the second of which said in part "Hintergrund ist die Einrichtung eines zweiten Datenzentrums" ("the reason for this is the establishment of a second data center") which is inaccurate.

Any idea where that text came from, and how we can improve it for next time?

For the switchback date: my leading candidates are October 6, 7, 13, or 14 -- we could flex later, but I think no earlier. Any preferences or conflicts, either wrt your availability or community events?

Concerning the message, I received some feedback about it on wiki talk pages. Some people said that the message was still visible after the switch ended. I investigated the case but I wasn't able to find anything. However, I wasn't able to understand if it was the first message displayed (with the yellow border), which is ours, or the green one, which is from an unknown source, very likely a local initiative. For the next time, I recommended to add that we will take care of the banners to the message distributed to communities. This would avoid local duplicates.

Concerning the next date, I think we should have 3 weeks of warnings. We had 4 for last month switch, which was comfortable. I think week of October 7 is too early (and I'm not available). I would suggest the week of October 21, if possible. Of not, I can make plans for the week of October 13, even if it is not the best option for me personally.

October 21 looks good, tentatively. Let me confirm with folks.

Would you like to plan the switchback here, or start a fresh task?

Thank you. A fresh task would be nice. :)

@RLazarus, any news about the date confirmation and the task?

Apologies, I've had a hectic few days. :) It turns out the engprod offsite is the week of Oct 26, so there already won't be a MW release train that week -- releng is understandably reluctant to skip the release two weeks in a row, which is a good argument against Oct 21. We don't need anyone from engprod online for the switchover itself, though, so we could run it during their offsite.

Tentatively, how would you feel about switching on Oct 27? (The 28th is a holiday in Greece so several SREs will be offline; it's suboptimal for us but workable in a pinch.)

Apologies, I've had a hectic few days. :)

It is a feeling I know very well. :)

Tentatively, how would you feel about switching on Oct 27?

Works for me!

Thanks for the shoutout in All Staff BTW!