Page MenuHomePhabricator

switchdc: Improve wgReadOnly message
Closed, ResolvedPublic

Description

As I said on wikitech-l, we should improve the message shown for $wgReadOnly mode when switching datacenters. The overall message is defined on translatewiki and takes an explanation as a parameter (See the manual for futher information). What we currently insert as this parameter there is
"MediaWiki is in read-only mode for maintenance. Please try again in a few minutes."

We should improve this message to clearly state that we are about to switch over datacenters and link to the meta page: https://meta.wikimedia.org/wiki/Tech/Server_switch_2017

When switching DCs, it is still necessary to edit the configuration files in rOMWC by hand, upload them to gerrit, merge and git pull them to the deployment host: Example. So it would be fairly easy instead of just commenting that lines or uncommenting them to just choose a message that is actually specific to the switchover we're doing at the very moment. The problem is that we currently have the generic message hardcoded in the switchdc scripts t02_start_mediawiki_readonly.py and t08_stop_mediawiki_readonly.py, so it seems in reality it isn't as easy to exchange that message as it should be.

BTW I couldn't find the operations-switchdc repository in diffusion (gerrit links to the diffusion repo don't work either) but only on github.

Event Timeline

I will also add that the central notice also needs changed to up-to-date messages, it says join a different IRC channel other than wikimedia-tech which is wrong.

The manual change + commit + deploy of the MW configuration might actually not be needed anymore, it depends on T163398. If that change lands in production before the switchback the related tasks in Switchdc will be updated to use conftool to change those values, hence that hardcoded part will go away anyway.

In case, for any reason, T163398 will not land in production in time, I can modify the Switchdc check to be a regex and just verify if the lines are commented out or not ignoring the actual message. Another approach could be to uncomment/comment $wgReadOnly in the same files (db-$dc.php) instead of having one variable per shard, but I'm not sure the final effect on MW will be the same.

The manual change + commit + deploy of the MW configuration might actually not be needed anymore, it depends on T163398. If that change lands in production before the switchback the related tasks in Switchdc will be updated to use conftool to change those values, hence that hardcoded part will go away anyway.

I thought this is to be removed at some point, didn't notice we're already on it. Great! So yeah, that should be the preferred way of course.

In case, for any reason, T163398 will not land in production in time, I can modify the Switchdc check to be a regex and just verify if the lines are commented out or not ignoring the actual message.

I already thought about just exchanging the message both in mediawiki-config and switchdc tool. regex would be better of course, however if it's just for this single switch on Wed, it may be easier just to exchange the message in both places and concentrate on moving to etcd afterwards. If you've already regex checks for that mediawiki.check_config_line part built in switchdc that's fine, but otherwise I wouldn't recommend to spend time implementing this if we'll only need it this week and will replace that part by some etcd-related stuff in the aftermath anyways. So if you feel it'd be the more easy thing to do, we should just exchange the one hardcoded message by another hardcoded message for now, that wouldn't do any harm in this case, given that we will change this script for etcd in the near-term future again anyways.

Another approach could be to uncomment/comment $wgReadOnly in the same files (db-$dc.php) instead of having one variable per shard, but I'm not sure the final effect on MW will be the same.

I really wondered why we're repeatedly keep telling the very same error message again and again, so exchanging to $wgReadOnly (after investigating it doesn't do any harm) would actually be something I'd really like.

The manual change + commit + deploy of the MW configuration might actually not be needed anymore, it depends on T163398. If that change lands in production before the switchback the related tasks in Switchdc will be updated to use conftool to change those values, hence that hardcoded part will go away anyway.

I thought this is to be removed at some point, didn't notice we're already on it. Great! So yeah, that should be the preferred way of course.

Me and @Volans helped test this feature tonight, and sadly it still needs some work before being reliably used in production. The fact we tried this tonight means we are very short on time: we have to revert steps in the automation process and test them. So I don't think we have time to do the following:

In case, for any reason, T163398 will not land in production in time, I can modify the Switchdc check to be a regex and just verify if the lines are commented out or not ignoring the actual message.

and sadly, returning the preferred message (which one?) will require an added round of testing we don't really have time for. So the only thing we can try to do (no promise, given the time constrains) is:

  • Agree quickly on a message to show
  • Change the message in the inactive datacenter now. I think we can live with that
  • Change the commit I did this morning for the switchover
  • Change the messages checked in switchdc
  • Add a further commit later on to revert the read-only message in the now-inactive datacenter after the switchover

Another approach could be to uncomment/comment $wgReadOnly in the same files (db-$dc.php) instead of having one variable per shard, but I'm not sure the final effect on MW will be the same.

I really wondered why we're repeatedly keep telling the very same error message again and again, so exchanging to $wgReadOnly (after investigating it doesn't do any harm) would actually be something I'd really like.

This could be fine, but we really have no time for this.

So, I'll do my best but I'll err on the side of caution. Any suggestions for an appropriate message to show to the users?

The manual change + commit + deploy of the MW configuration might actually not be needed anymore, it depends on T163398. If that change lands in production before the switchback the related tasks in Switchdc will be updated to use conftool to change those values, hence that hardcoded part will go away anyway.

I thought this is to be removed at some point, didn't notice we're already on it. Great! So yeah, that should be the preferred way of course.

When filing this task I expected to change the hardcoded message for today and thinking about how to generalize this in the aftermath of todays switch. In that sense, that you already had something to test tonight is much more than I expected. If it doesn't work now, that's not a problem. As you said, let's just exchange that hardcoded message in switchdc and your commit for now. All of this generalization work was never meant to be done until today (at least by me).

Any suggestions for an appropriate message to show to the users?

Well, basically anything that is a bit more informative. For what I want something like "MediaWiki is currently in read-only mode because we are switching our main datacenter. Find more information [[meta:Server_switch_2017|here]]." would already be fine (I don't know whether to wikitext or html-encode the link though). Basically it should include a hint that "maintenance" means switching datacenters and a link to that meta page - the meta page itself has all necessary information.

I'm currently not at home, so I can't amend the patch and submit one for switchdc myself - sorry about that, and thanks that you take care about this.

Some suggestions from IRC:

  • "This wiki is in read-only mode for a server switch test. See https://meta.wikimedia.org/wiki/codfw for more information." (preferred)
  • "This wiki is in read-only mode for a server switch test. This test has started at [hour] UTC and is schedule to end at [hour+30 min]."

MediaWiki is in read-only mode for a datacenter switchover test; please try again in a few minutes. See https://meta.wikimedia.org/wiki/Special:MyLanguage/Tech/Server_switch_2017 for details.

"Database is now read-only for a test; please try again in a few minutes. More at https://meta.wikimedia.org/wiki/Special:MyLanguage/Tech/Server_switch_2017."

Change 351614 had a related patch set uploaded (by Giuseppe Lavagetto; owner: Giuseppe Lavagetto):
[operations/mediawiki-config@master] Change the read-only message to be more informative during the switchover

https://gerrit.wikimedia.org/r/351614

Volans moved this task from In Progress to In Code Review on the SRE-tools board.

So: I tested on a local install with MW 1.29wmf18. I tested setting $wgReadOnly or readOnlyBySection with a link with brackets:

'DEFAULT' => "this is a test. [https://meta.wikimedia.org/wiki/Special:MyLanguage/Tech/Server_switch_2017]"

and a link without brackets:

'MediaWiki is in read-only mode for a datacenter switchover test. https://meta.wikimedia.org/wiki/codfw Please try again in a few minutes.'

Both rendered appropriately, the first with [1] where the '1' is a link, the second with the full text of the url marked up as a clickable link. Note to future self, wrapWikiMsg parses stuff, yay.

Change 351616 had a related patch set uploaded (by Giuseppe Lavagetto; owner: Giuseppe Lavagetto):
[operations/switchdc@master] Change the expected message in the db configurations

https://gerrit.wikimedia.org/r/351616

Change 351616 merged by Volans:
[operations/switchdc@master] Change the expected message in the db configurations

https://gerrit.wikimedia.org/r/351616

Change 351614 merged by Giuseppe Lavagetto:
[operations/mediawiki-config@master] Change the read-only message to be more informative during the switchover

https://gerrit.wikimedia.org/r/351614

Mentioned in SAL (#wikimedia-operations) [2017-05-03T11:37:36Z] <oblivian@naos> Synchronized wmf-config: Changing the read-only reason for the DC switchover (T164177) (duration: 01m 20s)

Volans claimed this task.
Volans moved this task from In Code Review to Done on the SRE-tools board.

@Volans [Not relevant for todays switchover, feel free to read later if busy]
I don't agree that this is resolved. This task, as shown in the description and various comments, is about two things:

  • show a nice message during the switchover today
  • improve our methods of setting the message, so that we can provide a specific message for each switchover, without hardcoding the message in switchdc script

You're right that the former is done. But the latter is not - we just exchanged one hardcoded string with another hardcoded string in switchdc script. Please either reopen this task or create another one for the second purpose. While I'd prefer to reopen this one (having all the comments in one place) I leave this closed for now and the decision up to you.

@EddieGP I agree with you, I closed it because this one was targeting this specific rollout and switchdc and didn't want to left it open until next switch.

Also I think that the remaining part is more related to MediaWiki in general. For example if we want a translated message, decouple it from the hardcoded configuration, etc.
For the next switchover almost certainly we'll have the etcd-driven configuration and the way to set the message will change.

For example with the current MediaWiki implementation of the etcd configuration, setting the variable ReadOnly to a string will make MediaWiki use that string as message. So in that case the message will be in only one place, switchdc. But if translated messages needs to be supported I'm sure things will change.

Feel free to re-open it, updating the Phab tags accordingly.