Page MenuHomePhabricator

Stop announcing and scheduling primary database switchovers
Closed, ResolvedPublic

Assigned To
Authored By
Ladsgroup
Mar 11 2022, 1:37 PM
Referenced Files
F35022591: image.png
Mar 24 2022, 7:29 PM
F35005116: image.png
Mar 14 2022, 10:28 AM
Tokens
"Like" token, awarded by Elitre."Orange Medal" token, awarded by Krinkle."Like" token, awarded by herron."Like" token, awarded by Jdforrester-WMF."Like" token, awarded by Daimona."Like" token, awarded by Mainframe98."Love" token, awarded by Marostegui.

Description

Status quo:
For primary switchovers, we currently create a ticket for CRS team (example), Schedule it around two weeks in advance, add a note about it in Tech news, and CRS sometimes sends an email about it to mailing lists and set up banners before the window. We usually request half an hour of read-only but switchovers barely take more than a minute (while it used to be longer).

Issues:

  • This is drastically reducing speed of doing database maintenance. We are currently doing a lot more changes and sometimes we have more than five schema changes waiting for primary switchovers (example). Beside most of schema changes, other maintenance works like OS upgrades, reboots for security updates, MariaDB upgrades, etc. need primary switchovers as well.
    • A schema change that doesn't need a primary switchover takes around a week. A similar one but needing primary switchovers take at least two months or so. Hindering other teams' ability to do their work.
  • This is taking a lot of CRS resources, for the coordination and work.
  • This also takes a lot of volunteer time to translate the message while users wouldn't even notice this.
  • It adds noise to the tech news leading to other more important notes being drowned.

Notes:

  • The read-only time is negligible and extremely small, usually around 30s, some large writes can cause longer read-only times.
  • The reason we ask for half an hour is that things might go wrong and read-only take longer but 1- This has not ever happened since this process started 2- This can happen with any type of maintenance work.
  • I tested locally and if the site is read-only and you try to save your edit (assuming you opened the edit page before read-only), The response still contains the edit and you can simply click on the Submit button, nothing gets lost according to my test.

Proposed solution
Add two 30 minutes slots per week (Tue/Thu) in the deployment calendar. 7:00 UTC which is an hour before morning EU backport window. This is when the db load is the lowest. DBAs would do the switchovers during that window (they can just not use that window if they don't have anything planned). No announcement will be sent but DBAs add the planned tickets in the "Changes" section of that window for bookkeeping.

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

Please retitle to what the actual suggestion is, I almost got a heart attack.

What do you suggest the title should be? 😅

Well, it should mention databases, currently it kind of implies data center switchovers as well :-)

And JFTR, I really like this proposal!

Ladsgroup renamed this task from Stop announcing and scheduling primary switchovers to Stop announcing and scheduling primary database switchovers.Mar 14 2022, 1:25 PM

Please retitle to what the actual suggestion is, I almost got a heart attack.

What do you suggest the title should be? 😅

Well, it should mention databases, currently it kind of implies data center switchovers as well :-)

Oh, Thanks. Changed now.

Please retitle to what the actual suggestion is, I almost got a heart attack.

What do you suggest the title should be? 😅

You are suggesting making dedicated slots, rather than just stopping having switchovers.

Relatedly, the CRS folks who have been working on them so far would like to hear your thoughts about how that is going to help, in practice, the relevant people to be aware and stay informed.

If a switchover becomes like any other item in a development train, doesn't that make it difficult to spot and spread awareness for?

Thanks.

Relatedly, the CRS folks who have been working on them so far would like to hear your thoughts about how that is going to help, in practice, the relevant people to be aware and stay informed.

If a switchover becomes like any other item in a development train, doesn't that make it difficult to spot and spread awareness for?

Yes but does people need to be informed about this? The process is much more stable and faster than what it used to be. To comparison, train deploys causes large spikes in errors after deployment until around ten minutes later because of the caches being cold and we don't put up banners for that. Lots of maintenance work cause more disruption than primary switchovers.

We can do a similar thing like the train. Having a regular note in tech news: "The next week you may not be able to edit around 07:00 AM UTC Tue/Thu due primary database switchovers" to give enough information to people. If people want to know more in detail, they can check the deployment calendar.

Does that make sense to you?

I'm currently working on reducing the burden both for CRS and translators regarding server switches and read-onlys. This proposal comes at the right time.

Whatever the DBA team decides to do, CRS have setup all messages and work on making them reusable. For instance, the new read-only message doesn't need to be translated all the time. It diminishes our dependency to translators. The same applies to Tech News. Also, the work we do on read-onlys has been refined to be usable on server-switches (and vice-versa).

And, for the record, I had a quick meeting with @Ladsgroup, who ave me the specifics.

If there is no major objection by Monday 28th, I will move forward to implementing this.

For minor objections to details, we can hammer it before or after.

Change 771334 had a related patch set uploaded (by Marostegui; author: Marostegui):

[operations/software@master] switchover-tmpl.sh: Remove communication related steps

https://gerrit.wikimedia.org/r/771334

Change 771334 merged by jenkins-bot:

[operations/software@master] switchover-tmpl.sh: Remove communication related steps

https://gerrit.wikimedia.org/r/771334

If there is no major objection by Monday 28th, I will move forward to implementing this.

For minor objections to details, we can hammer it before or after.

Is it for a test, or the decision is final?

hmm, good point, We can change the plans a bit. Move it forward and start the test run for two weeks in March 21 and make the final decision afterwards. But given that 21 March seems to be the week of trainsperiment. We are back the original date.

In other words, We do this for two weeks from March 28 until April 11. And make final decision in that day.

Thank you.

Do you plan to have a blogpost about it, listing reasons and implications?

What would be the best sentence to have a recurring item on Tech News, like what we have for the train? Example for the train:

The new version of MediaWiki will be on test wikis and MediaWiki.org from 22 March. It will be on non-Wikipedia wikis and some Wikipedias from 16 March. It will be on all wikis from 17 March (calendar).

What would be your suggestion? Can we imagine something along these lines:

A change on databases will be performed on March 22 at 7:00 UTC and on March 24 à 7:00 UTC. Some wikis will be in read-only for a few minutes.

Thank you.

Do you plan to have a blogpost about it, listing reasons and implications?

If CRS thinks that'd be useful/important. Sure.

What would be the best sentence to have a recurring item on Tech News, like what we have for the train? Example for the train:

The new version of MediaWiki will be on test wikis and MediaWiki.org from 22 March. It will be on non-Wikipedia wikis and some Wikipedias from 16 March. It will be on all wikis from 17 March (calendar).

What would be your suggestion? Can we imagine something along these lines:

A change on databases will be performed on March 22 at 7:00 UTC and on March 24 à 7:00 UTC. Some wikis will be in read-only for a few minutes.

LGTM. The à should be "at" if I'm not mistaken but otherwise looks good. (We probably can also link to the deployment calendar and say something like "see this for more info" but I leave it to the experts if you think it's needed or not.)

We probably can later revisit this and drop the whole line altogether but let's do one change at a time.

Thank you.

Do you plan to have a blogpost about it, listing reasons and implications?

Is it really needed? We are not making any fundamental changes to our infra, and it is unlikely we'll be using those two allocated windows every week, it is not that common for us to do switchovers. It depends on the week (ie: some urgent version fixes or whatever) but normally we don't need to switchover masters that often.

Anyways, as subject matter expert, it is your call - if it makes sense, we can definitely do it!

The best practice nowadays is to have blogposts about changes, showing how our infrastructure, our products and our practices evolve. But it is up to you. :)

Regarding Tech News, it is important for CommRel to have access to a page where you'd schedule DB changes. This way, we would add or remove the line as part of our weekly process. Can you setup this page if not already existing? When done, I will take care of Tech News processes and doc (T303937).

The sentence I will add to Tech news has variables: the dates.

A change on databases will be performed on $date1 and on $date2. Some wikis will be in read-only for a few minutes.

In order to lift translators burden, it is important to only play with variables, not the sentence itself. The goal is to have the sentence being translated once for good. I would be nice to be okay on a final wording.
Please also note that it wont be possible to remove one of the two dates from the sentence in an easy way. It is very simple in English, but it may not be that simple for other languages. :)

The best practice nowadays is to have blogposts about changes, showing how our infrastructure, our products and our practices evolve. But it is up to you. :)

Regarding Tech News, it is important for CommRel to have access to a page where you'd schedule DB changes. This way, we would add or remove the line as part of our weekly process. Can you setup this page if not already existing? When done, I will take care of Tech News processes and doc (T303937).

It'll be in https://wikitech.wikimedia.org/wiki/Deployments

I assume that the first iteration of this new process will be for T302283: Read-only window needed for s6?

I assume that the first iteration of this new process will be for T302283: Read-only window needed for s6?

If possible, we have another maintenance that we'll like to do first, which is T301850 (basically the same as the one you suggested but affecting s3 wikis instead of s6 wikis - but same impact (read only) and duration).
Let me know if that's fine

Thank you.

Do you plan to have a blogpost about it, listing reasons and implications?

Is it really needed? We are not making any fundamental changes to our infra, and it is unlikely we'll be using those two allocated windows every week, it is not that common for us to do switchovers. It depends on the week (ie: some urgent version fixes or whatever) but normally we don't need to switchover masters that often.

Anyways, as subject matter expert, it is your call - if it makes sense, we can definitely do it!

I think a blog post about this would be great, the recent advancements in DB-related automation have been pretty cool but I feel that many people aren't aware of them, or recognize how much manual effort is no longer needed, making this process easier, faster, safer, more routine, etc.

Yeah. This ticket in itself is probably not worth a blogpost but if you add auto_schema and so on, it's a good opportunity to explain what's happening.

Yeah, master swap wise we haven't changed anything in the last months, we have been using the switchover script developed by Jaime and dbctl for more than a year now (or even more). So in that regard we are not changing anything. We have simply built lots of trust (based on loooots of master switchovers) where the downtime was around 30 seconds and we believe we are ready to start making this a normal maintenance operation.

Change 772813 had a related patch set uploaded (by Ladsgroup; author: Amir Sarabadani):

[mediawiki/tools/release@master] Add primary database switchover windows

https://gerrit.wikimedia.org/r/772813

Change 772813 merged by jenkins-bot:

[mediawiki/tools/release@master] Add primary database switchover windows

https://gerrit.wikimedia.org/r/772813

If possible, we have another maintenance that we'll like to do first, which is T301850 (basically the same as the one you suggested but affecting s3 wikis instead of s6 wikis - but same impact (read only) and duration).
Let me know if that's fine

It is fine. :)
I have to work on T303937: Update Tech News preloads and documentation for the new primary database switchovers workflow a bit more but I think we will be on time there.

In order to really inform the communities, we should have a clear list of targeted wikis. I don't see one besides "s3" in T301850: Switchover s3 master (db1157 -> db1123). Do we have a wikipage with what matches s3 we can link to? The idea is to have a permanent URL we can use in the message (s the message would be reused all the time).

And you need to start using User-notice in switchover tasks. :)

In order to really inform the communities, we should have a clear list of targeted wikis. I don't see one besides "s3" in T301850: Switchover s3 master (db1157 -> db1123). Do we have a wikipage with what matches s3 we can link to? The idea is to have a permanent URL we can use in the message (s the message would be reused all the time).

Not a problem, we can start including the list of wikis in our tickets. For future references I will leave this here too anyways:
https://noc.wikimedia.org/conf/highlight.php?file=dblists/s1.dblist
https://noc.wikimedia.org/conf/highlight.php?file=dblists/s2.dblist
https://noc.wikimedia.org/conf/highlight.php?file=dblists/s3.dblist
https://noc.wikimedia.org/conf/highlight.php?file=dblists/s4.dblist
https://noc.wikimedia.org/conf/highlight.php?file=dblists/s5.dblist
https://noc.wikimedia.org/conf/highlight.php?file=dblists/s6.dblist
https://noc.wikimedia.org/conf/highlight.php?file=dblists/s7.dblist
https://noc.wikimedia.org/conf/highlight.php?file=dblists/s8.dblist

They are also available at: https://noc.wikimedia.org/db.php

And you need to start using User-notice in switchover tasks. :)

Good! Done for the first two next week.

Change 773440 had a related patch set uploaded (by Marostegui; author: Marostegui):

[operations/software@master] switchover-tmpl.sh: Add "Affected wikis" field

https://gerrit.wikimedia.org/r/773440

Not a problem, we can start including the list of wikis in our tickets.

It is be very helpful. Thank you!

I'm a bit hesitant on the current state. It's too close to the status quo. We can keep it like this for test periods but I think in the long-term it won't be sustainable. We can also move with incremental changes and slowly stop doing that in later cases.

My ideal solution would be to have the tech news entry be constant and link to the deployment calendar for more info. In other words, currently it's this:

Some wikis will be in read-only for a few minutes because of a change on databases. It will be performed on 29 March at 7:00 UTC (targeted wikis) and on 31 March at 7:00 UTC (targeted wikis). [1][2]

My suggested changes are something along the lines of:

Some wikis might be in read-only for a few minutes because of regular database maintenance. It will be performed on 29 March at 7:00 UTC and on 31 March at 7:00 UTC. See deployment calendar for more information [https://wikitech.wikimedia.org/wiki/Deployments]

This:

  • Removes the need to add user-notice to each one of switchover tickets
  • More importantly, This removes the extra time DBAs need to have (it must be a week beforehand so it makes it to the tech news) which is the biggest obstacle of doing switchover. Granted, the current mode reduces it from two weeks to one week but I'm aiming to have a mode where a DBA decide on the Monday that a switchover is needed and slot it for the day after without any notice.
    • This can be useful for emergency switchovers (not the types that the master is dead which needs to be done right away, but the type that PDU has issues, NIC or ethernet has issues, etc. and might flap at any minute)
    • The other reason is that, in the most extreme cases of schema changes or OS upgrades, security updates, etc. We need to do a rolling switchover of all sections, which would be eight switchovers and it would take a month (I think we probably can do several switchovers in one window but that can be debated). The overhead of user notice for all eight sections is quite a lot for everyone involved.

How does that sound? Again, we can have this for two weeks as test run and decide to change later or we can go with this suggestion and make these changes a bit later. Whatever works for you.

P.S. I'm sorry I'm not very responsive these days, I have been a bit sick.

That was my assumption too @Ladsgroup that all this was for these "tests". I assumed after those iterations, we will simply have those windows reserved for us to to maintenance without having to give any heads up other than the ticket itself, were we can and should include the affected wikis. I don't mind also tagging user-notice if that's helpful for everyone, but definitely not having to wait for it to go to technews, again, I thought that was just for this initial testing.

Hiyo. After reading all the above, I'm still confused! I think the plan might have changed slightly during the course of related discussions.
It seems like the latest proposal from engineers is that we stop including these items in Tech News entirely, because the process is now stable & brief enough, and regularly needed enough, that it is no longer practical or necessary to schedule these individually more than a week ahead. -- Instead, the new process will simply become part of the work that is done twice a week, whenever needed. (cf. Ladsgroup's last comment)

If that is accurate, then I agree it would be great to have a blog post to point towards, when we announce this change.

Given that it is Thursday already and I don't want to pressure anyone to write a blog-post or decide on all this within 24 hours...
I propose that we do a 'normal' announcement this week, and a clearer announcement when a blog-post is ready.
I.e.
Current draft

Some wikis will be in read-only for a few minutes because of regular database maintenance. It will be performed on 29 March at 7:00 UTC (targeted wikis) and on 31 March at 7:00 UTC (targeted wikis). [1][2] This is a new procedure. The database maintenance will be performed twice a week when needed. [3]

Proposed draft for this week instead (just removing the last 2 sentences):

Some wikis will be in read-only for a few minutes because of regular database maintenance. It will be performed on 29 March at 7:00 UTC (targeted wikis) and on 31 March at 7:00 UTC (targeted wikis). [1][2]

Proposed (very rough) draft for next week (or later), IIUC:

There is a regular need for database maintenance which requires placing some wikis into read-only mode. In recent years this processes has become faster, and now usually only take ~30 seconds. If you try to Publish an edit during these seconds, it will [????]. Therefore, these read-only periods will no longer be announced ahead of time. Instead, they now have a standard timeslot in the [deployment schedule] every Tuesday and Thursday at 07:00 UTC. You can [LINK | read more about the details and technical background that led to this change].

The aspect I'm still very uncertain about ("[????]" in the draft above) is what exactly happens (from a user-perspective) if we click "Publish" during this read-only time. Ladsgroup wrote "nothing gets lost", but does that mean the UI just stays in the same 'edit-mode' view (and churns for a while?) and the user needs to click "Publish" a second time? or something else? A screenshot/gif might help.

Hiyo. After reading all the above, I'm still confused! I think the plan might have changed slightly during the course of related discussions.
It seems like the latest proposal from engineers is that we stop including these items in Tech News entirely, because the process is now stable & brief enough, and regularly needed enough, that it is no longer practical or necessary to schedule these individually more than a week ahead. -- Instead, the new process will simply become part of the work that is done twice a week, whenever needed. (cf. Ladsgroup's last comment)

Close. I could go either of these two:

  • Avoid putting anything on tech news at all.
  • or Have something general without needing to be changed in tech news every week. The general text being something like:
Some wikis might be in read-only for a few minutes because of regular database maintenance. It will be performed on $DATE1 at 7:00 UTC and on $DATE2 at 7:00 UTC. See deployment calendar for more information [https://wikitech.wikimedia.org/wiki/Deployments]

Completely stand-alone and no DBA or CRS change on it. We can remove it on no deploy weeks (e.g. Christmas). Something similar to the train rows basically.

I'm happy with either one, whatever CRS picks.

If that is accurate, then I agree it would be great to have a blog post to point towards, when we announce this change.

Given that it is Thursday already and I don't want to pressure anyone to write a blog-post or decide on all this within 24 hours...
I propose that we do a 'normal' announcement this week, and a clearer announcement when a blog-post is ready.
I.e.
Current draft

Some wikis will be in read-only for a few minutes because of regular database maintenance. It will be performed on 29 March at 7:00 UTC (targeted wikis) and on 31 March at 7:00 UTC (targeted wikis). [1][2] This is a new procedure. The database maintenance will be performed twice a week when needed. [3]

Proposed draft for this week instead (just removing the last 2 sentences):

Some wikis will be in read-only for a few minutes because of regular database maintenance. It will be performed on 29 March at 7:00 UTC (targeted wikis) and on 31 March at 7:00 UTC (targeted wikis). [1][2]

Proposed (very rough) draft for next week (or later), IIUC:

There is a regular need for database maintenance which requires placing some wikis into read-only mode. In recent years this processes has become faster, and now usually only take ~30 seconds. If you try to Publish an edit during these seconds, it will [????]. Therefore, these read-only periods will no longer be announced ahead of time. Instead, they now have a standard timeslot in the [deployment schedule] every Tuesday and Thursday at 07:00 UTC. You can [LINK | read more about the details and technical background that led to this change].

The aspect I'm still very uncertain about ("[????]" in the draft above) is what exactly happens (from a user-perspective) if we click "Publish" during this read-only time. Ladsgroup wrote "nothing gets lost", but does that mean the UI just stays in the same 'edit-mode' view (and churns for a while?) and the user needs to click "Publish" a second time? or something else? A screenshot/gif might help.

"the user needs to click "Publish" a second time?" is the correct answer. I think the error even asks the user to simply try again. Similar to "session lost" errors.

image.png (1×1 px, 192 KB)

"weee test" is a proper message in production.

Change 773440 merged by Marostegui:

[operations/software@master] switchover-tmpl.sh: Add "Affected wikis" field

https://gerrit.wikimedia.org/r/773440

s3 switchover was done today and read only was 35 seconds.

Change 774819 had a related patch set uploaded (by Marostegui; author: Marostegui):

[operations/software@master] switchover-tmpl.sh: Change maintenance hour

https://gerrit.wikimedia.org/r/774819

Change 774819 merged by Marostegui:

[operations/software@master] switchover-tmpl.sh: Change maintenance hour

https://gerrit.wikimedia.org/r/774819

As discussed on Slack, the plan is:

  • to announce scheduled switchbacks in TN,
  • otherwise announce the week after that it happen off-schedule.

This solution reduces the risk of readers ignoring a weekly recurring issue, but we still inform our readers and the work done is documented.
Both sentences can easily be added in TN's init, so that they can be kept in translation memory. It reduces the translators' burden.

We have defined a sentence in Tech News for scheduled switchovers. We will soon have one for off-schedule switchovers.

s5 switchover was done and read only time was 64 seconds.

Change 776892 had a related patch set uploaded (by Marostegui; author: Marostegui):

[operations/software@master] switchover-tmpl.sh: Add prerequisites link and calendar invite

https://gerrit.wikimedia.org/r/776892

Change 776892 merged by Marostegui:

[operations/software@master] switchover-tmpl.sh: Add prerequisites link and calendar invite

https://gerrit.wikimedia.org/r/776892

s4 (commonswiki) switchover was done today: read only time was 28 seconds.

06:00:57
06:01:25

Now that we are good with the solution of 30 minutes slots per week (Tue/Thu) in the deployment calendar, CRS work is done. We have a placeholder on Tech News for upcoming scheduled deployments. The only thing we need are tasks about upcoming switchovers being tagged with User-notice.

I think this task can be closed now. :)

+1 to close
Thanks for all the help