Page MenuHomePhabricator

Document communication expectations around planning a DC switchover
Closed, ResolvedPublic

Description

On the notes etherpad, @LSobanski wrote:

We need to notify all SREs (not just the one in "SRE team") about the switchover, both ahead of time and when starting. Also other people, e.g. Community Relations, ...

And @Kormat wrote:

Even within the "SRE team", there wasn't any mail to the SRE or Ops lists about this. I believe a mail was sent to wikitech-l, but a lot of people aren't on that.

Personally, I (@Legoktm) don't really understand why people aren't subscribed wikitech-l given that's where our public technical discussion is expected to take place, which is why I sent the announcements there and expected people to read it that way.

@LSobanski suggested to me yesterday to create a calendar event, which I did for sre@wm.o, however it seems that doesn't cover embedded SREs. Maybe there's a different calendar that should be used instead? It's also been listed on the deployment calendar since last week, are people not checking that?

I also get the impression that there's a desire for DC switchovers to be a "normal thing we do" and a non-event, if that's the case then do we need to make special announcements?

I would consider this task resolved when the "Weeks in advance preparation" and "Days in advance preparation" sections on https://wikitech.wikimedia.org/wiki/Switch_Datacenter#MediaWiki are updated to reflect at what point what communication should be sent.

Event Timeline

A few points:

  • Anyone who works in the technical wikimedia community should be subscribed to wikitech-l
  • Anyone who releases software should 1 - ensure that they know what's happening around them 2 - look at other calendarized deployments (and the switchover was). The information was one jouncebot: next message on IRC away from anyone doing a deployment.

X-posting the announcement to ops-l might have helped, but I definitely do not think we should keep announcing them org-wide or with a banner on-wiki in the future, because as @Legoktm wrote, they're going to become "normal" maintenance windows.

Rather than expanding how much we communicate the switchover, I think we should instead have a way to technically lock all deployments, be then scap/scap3/helm-driven, for the duration of the switchover.

If I had to chose, Id' decline this task.

In T285806, @Legoktm wrote:

Personally, I (@Legoktm) don't really understand why people aren't subscribed wikitech-l given that's where our public technical discussion is expected to take place, which is why I sent the announcements there and expected people to read it that way.

  • Anyone who works in the technical wikimedia community should be subscribed to wikitech-l

These are very MW-centric viewpoints. I've had a look over the last 50 threads on wikitech-l, just to get a feel for things. They fall into these categories:

  • The switchover thread.
  • 95% irrelevant to me.
  • 5% relevant, but i know about through other fora.

That's a Lot of noise to ask everyone to wade through.

I also get the impression that there's a desire for DC switchovers to be a "normal thing we do" and a non-event, if that's the case then do we need to make special announcements?

Speaking with my DBA hat on, that's a nice idea, but we are literally *years* away from that being reality for my team, at least.

X-posting the announcement to ops-l might have helped

Yes, it absolutely would have helped.

Speaking of communication expectations, it would be good to write down the notice period for the switch itself. Personally, i'd suggest 2 months, minimum. In this case we didn't have a confirmed day until less than 2 weeks before. This causes a lot of stress and makes planning difficult for us. The same goes for not having a "we'll be in the other DC for at least X amount of time".

We are not yet at the point where DC switch is a non event and even when we get there, it's still an operation that can cause broad impact if we run into unexpected issues so I don't think we should be limiting the awareness of it. There is certainly an opportunity to bake notifications into the automated process.

P.S. The original comment in Etherpad was mine, I updated the description to reflect this. Happy to discuss my point of view (and assumptions of who reads what) in a more interactive setting :)

I've tried to summarize a combination of what I did and the feedback here into https://wikitech.wikimedia.org/wiki/Switch_Datacenter/Coordination - please suggest changes/additions or just make them yourself :)

In T285806, @Legoktm wrote:

Personally, I (@Legoktm) don't really understand why people aren't subscribed wikitech-l given that's where our public technical discussion is expected to take place, which is why I sent the announcements there and expected people to read it that way.

  • Anyone who works in the technical wikimedia community should be subscribed to wikitech-l

These are very MW-centric viewpoints. I've had a look over the last 50 threads on wikitech-l, just to get a feel for things. They fall into these categories:

  • The switchover thread.
  • 95% irrelevant to me.
  • 5% relevant, but i know about through other fora.

That's a Lot of noise to ask everyone to wade through.

I believe that's true regarding just about every not-small mailing list I'm subscribed to ;-) But in general, wikitech-l is the only *public* mailing list we have for these kind of discussions, and given that by default we work in the open, it should be our default choice for just about everything that isn't intentionally private.

X-posting the announcement to ops-l might have helped

Yes, it absolutely would have helped.

Added to the wiki page.

I also get the impression that there's a desire for DC switchovers to be a "normal thing we do" and a non-event, if that's the case then do we need to make special announcements?

Speaking with my DBA hat on, that's a nice idea, but we are literally *years* away from that being reality for my team, at least.

Fair! I tried to explain this in the lead of the wiki page, please expand/edit if you don't think it captures that sentiment.

Speaking of communication expectations, it would be good to write down the notice period for the switch itself. Personally, i'd suggest 2 months, minimum. In this case we didn't have a confirmed day until less than 2 weeks before. This causes a lot of stress and makes planning difficult for us.

Ack, included in the wiki page.

The same goes for not having a "we'll be in the other DC for at least X amount of time".

So you'd like the switch back to be scheduled at the same time basically? Or just want to know minimum duration we'll be switched?

After talking off-phabricator with a few people, I think what we have seen is more of a failure of coordination between affected SRE teams than of external communication. I take the blame for not making things clearer on this topic with @Legoktm on the lead-up to the switchover.

In essence, there are several SRE teams that depend on the MediaWiki+Services switchover to perform fundamental maintenance in eqiad, and need to know with some advance when to stop maintenance in codfw.

What we should've done is set up a kick-off meeting with the other SRE teams just to plan better the work for everyone. We should have one once everyone's back from the time off to at least coordinate on the date of the switchback.

I apologize with everyone, when I said repeatedly I think the switchover should be a 'non-event', I meant it should be from the point of view of our users and the rest of the organization:

  • We shouldn't need to put up banners on all wikis
  • We shouldn't need to halt deployments for one week
  • In general, everything should be business as usual for most folks at the WMF and WMDE

This wasn't meant to include the rest of SRE that is directly impacted by the switchover - I think of at least DBAs, netops and dcops as teams that have a big stake in the procedure. So @LSobanski I think we're in full agreement there :)

Thanks everybody for the feedback on the communications for the DC switchover process. We will spend some time this quarter (Q1) in working through the typical timelines for the process and define the groups involved and the best mechanism to make sure that everybody is aware of the status and progress. Please take a look at https://wikitech.wikimedia.org/wiki/Switch_Datacenter and https://wikitech.wikimedia.org/wiki/Switch_Datacenter/Coordination to see if what we have captured there makes sense.

Thanks everybody for the feedback on the communications for the DC switchover process. We will spend some time this quarter (Q1) in working through the typical timelines for the process and define the groups involved and the best mechanism to make sure that everybody is aware of the status and progress. Please take a look at https://wikitech.wikimedia.org/wiki/Switch_Datacenter and https://wikitech.wikimedia.org/wiki/Switch_Datacenter/Coordination to see if what we have captured there makes sense.

Hey @wkandek, thanks for this. It looks like it covers all the important things from my POV (but i defer to @Marostegui in case i've overlooked something :).

That would work for me too @wkandek - thanks!

  • (1) From my perspective, the switchover went smoothly. Most tasks were well documented and automated. I know of no serious consequences of the late communication.
  • (2) [Proposed solution to an issue described below]
    • A Technology dep team should send messages to the following mailing lists:
      • ops - two weeks before
      • wikitech-l - two and one week before
      • Optionally, any internal WMF mailing list - two weeks before?
    • Two weeks before (after the message to wikitech-l), the CRS team will send a message to translators-l - with a request to proofread this message.

The issue: currently, in Switch Datacenter/Coordination, it is not stated which team sends the announcements to the mailing lists. This is unclear because:

  • In June (see T281209), wikitech-l and translators-l were assigned to CRSs, ops was not mentioned, and wmfall (currently: optional or official) was with a note: "done by Ops."
  • Now, any internal WMF-wide mailing list is not mentioned, and wikitech-l and ops are listed together as if this was a task for one team. I presume (may be wrong) it's a Technology dep team rather than CRS.

(3) As early as possible, the Movement Communications team within the Comms dep should be contacted. Last FY, they began collecting information about any planned community-facing WMF activities, and they created a process of keeping everything recorded in one internal calendar. To shorten the communication chain, I'd suggest to assign this task to a Technology dep team.

@Legoktm: Per emails from Sep18 and Oct20 and https://www.mediawiki.org/wiki/Bug_management/Assignee_cleanup , I am resetting the assignee of this task because there has not been progress lately (please correct me if I am wrong!). Resetting the assignee avoids the impression that somebody is already working on this task. It also allows others to potentially work towards fixing this task. Please claim this task again when you plan to work on it (via Add Action...Assign / Claim in the dropdown menu) - it would be welcome. Thanks for your understanding!

Joe claimed this task.

Given we now have switchovers at regular intervals, we can resolve this task. There is no need to do a lot of communication around it.