Page MenuHomePhabricator

Database primary master failover on s8 (wikidatawiki)
Closed, ResolvedPublic

Description

We need to replace the current primary database master for wikidatawiki.
This host is old and out of warranty, so needs to be decommissioned. In addition, we need a host with bigger disk to be able to continue with the wb_terms table redesign (T221764).

We would need a 30 minutes read-only window for Wikidatawiki.

Date: Tue 30th July
Time: 05:00AM UTC - 05:30 AM UTC (if everything goes as planned we would not use the 30 minutes window)

Impact: All Wikidatawiki will go read-only. No edits will be allowed. Reads will not be impacted.

Event Timeline

Marostegui triaged this task as Normal priority.Jul 4 2019, 9:18 AM

Thank you!

Johan claimed this task.Jul 9 2019, 11:56 AM
Restricted Application added a project: User-Johan. · View Herald TranscriptJul 9 2019, 11:56 AM
Johan moved this task from Backlog to Do now on the User-Johan board.Jul 9 2019, 12:00 PM
Johan moved this task from Backlog to Started on the CommRel-Specialists-Support (Jul-Sep-2019) board.
Jc86035 added a subscriber: Jc86035.Jul 9 2019, 4:04 PM

Is there an existing procedure for reflecting the page moves and deletions that will occur during the period that Wikidata is read-only?

Items are updated for moves via the job queue, so unless I’m mistaken the job should fail while the wiki is read-only and be retried automatically at a later time.

Johan added a comment.Jul 10 2019, 8:52 AM

Could someone confirm what @Lucas_Werkmeister_WMDE is saying about the job retry? (Asking since he's hedging with "unless I'm mistaken".) Would prefer this to be clear when we communicate this.

Ladsgroup added a subscriber: hoo.Jul 11 2019, 9:31 AM

Could someone confirm what @Lucas_Werkmeister_WMDE is saying about the job retry? (Asking since he's hedging with "unless I'm mistaken".) Would prefer this to be clear when we communicate this.

Lucas is right, technically any job that fail for any reason will gets retried until they pass or they pass the max number of failures (it's 30 I think) but strangely UpdateRepoOnMoveJob (more precisely UpdateRepoJob) doesn't fail if it can't edit, it just sends a debug message and act like nothing happened in saveChanges, is it intentional @hoo or do you think we should change the behavior.

Johan added a comment.Jul 11 2019, 9:34 AM

Thanks! And since UpdateRepoOnMoveJob doesn't fail and won't try again, the page move wouldn't actually be reflected in Wikidata? Or am I missing what function it has?

Thanks! And since UpdateRepoOnMoveJob doesn't fail and won't try again, the page move wouldn't actually be reflected in Wikidata? Or am I missing what function it has?

Yes, we might fix that or we might say it's a bearable loss. I don't know enough context to say which one is better.

@Ladsgroup @hoo Either way it'd be great if we could come to a conclusion on what, so we know what to tell the communities.

(I mean, I'd certainly prefer if pages weren't lost for years in the language links because of a move during these minutes, but I don't know what the cost of making sure that wouldn't happen would be.)

but strangely UpdateRepoOnMoveJob (more precisely UpdateRepoJob) doesn't fail if it can't edit, it just sends a debug message and act like nothing happened in saveChanges

More specifically, saveChanges() does return true/false to indicate if everything’s okay, but run() ignores the return value and unconditionally returns true itself. I don’t know if this was intentional at the time, but I also think it would probably be better to retry such cases (i. e. return $this->saveChanges( $item, $user );).

Can we see those debug messages anywhere, by the way? I assume the X-Wikimedia-Debug header doesn’t help us with jobs, and I’m not sure if debug-level messages are usually saved elsewhere (except on testwiki and test2wiki, apparently, which aren’t Wikibase repositories).

hoo added a comment.Jul 13 2019, 12:46 PM

but strangely UpdateRepoOnMoveJob (more precisely UpdateRepoJob) doesn't fail if it can't edit, it just sends a debug message and act like nothing happened in saveChanges

More specifically, saveChanges() does return true/false to indicate if everything’s okay, but run() ignores the return value and unconditionally returns true itself. I don’t know if this was intentional at the time, but I also think it would probably be better to retry such cases (i. e. return $this->saveChanges( $item, $user );).

But that would also retry on for example sitelink conflicts (other item already has that sitelink), but I guess that's bearable.

Why are jobs even being run during read only time? After quickly skimming the job queue code, this shouldn't happen AFAICT, is there a task/ documentation about that?

Johan added a comment.Jul 15 2019, 9:02 AM

@hoo Shouldn't happen, as in "won't happen, we're worrying unnecessarily"?

Johan added a comment.Jul 15 2019, 9:10 AM

So far posted in:

I'll figure out what to do about banners and include it in the issue of Tech News the weak of the read-only period, and then we should be done with this part of the preparations.

hoo added a comment.Jul 15 2019, 9:23 AM

@hoo Shouldn't happen, as in "won't happen, we're worrying unnecessarily"?

It seems to me, yes.

I talked to @Trizek-WMF about banners. Since edits coming from other wikis will be run after the read-only period which hopefully will be rather short I don't think we need to do a banner for all Wikimedia wikis, but we'll set up one for Wikidata.

Trizek-WMF added a comment.EditedJul 17 2019, 3:02 PM

An information banner will be displayed on Wikidata, between 04:30 UTC and 05:30 UTC on the 30th of July.

If some messages are planed to be left on the wikis, here is the link for that banner's translations: https://meta.wikimedia.org/w/index.php?title=Special:Translate&group=Centralnotice-tgroup-read_only_banner&task=view&filter=%21translated&action=translate Main languages are translated (at least most of them), but more languages are always welcomed.

Marostegui closed this task as Resolved.Jul 30 2019, 5:16 AM

The failover was done successfully.
read-only start: 05:00:50
read-only stop: 05:02:21

Total read-only time: 01:31 minutes

Thanks for helping with the communication with the community!

Johan moved this task from Do now to Archive on the User-Johan board.Jul 30 2019, 8:02 AM