⚓ T313811 Switchover x2 master db2142 -> db2144

	Subject	Repo	Branch	Lines +/-
	site.pp: Promote db2144 to x2 codfw master	operations/puppet	production	+3 -3

Status	Assigned	Task
Resolved	Papaul	T309956 codfw: Master PDU rack/setup row A, row B, rowC and row D task
Resolved	Papaul	T309957 (Need By:TBD) rack/setup/install row A new PDUs
Resolved	Marostegui	T313811 Switchover x2 master db2142 -> db2144

Marostegui created this task.Jul 26 2022, 2:42 PM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptJul 26 2022, 2:42 PM

Marostegui renamed this task from Switchover x2 master to Switchover x2 master db2142 -> db2143.Jul 26 2022, 2:44 PM

Marostegui updated the task description. (Show Details)

@Krinkle @tstarling can x2 be set to RO entirely for a few seconds? Is that done via dbctl I would assume like any other chain with: dbctl --scope codfw section x2 ro "maintenance"

Marostegui updated the task description. (Show Details)Jul 26 2022, 2:49 PM

Marostegui updated the task description. (Show Details)

Marostegui moved this task from Triage to In progress on the DBA board.

Mentioned in SAL (#wikimedia-operations) [2022-07-26T14:51:16Z] <marostegui@cumin1001> dbctl commit (dc=all): 'Set db2143 with weight 0 T313811', diff saved to https://phabricator.wikimedia.org/P31951 and previous config saved to /var/cache/conftool/dbconfig/20220726-145116-root.json

Marostegui updated the task description. (Show Details)Jul 26 2022, 2:51 PM

Mentioned in SAL (#wikimedia-operations) [2022-07-26T14:54:13Z] <marostegui@cumin1001> dbctl commit (dc=all): 'Set db2144 with weight 0 and db2143 back with 100 T313811', diff saved to https://phabricator.wikimedia.org/P31952 and previous config saved to /var/cache/conftool/dbconfig/20220726-145412-root.json

Let's use db2144 instead, db2143's rack (b5) will go on maintenance the following day or even the same one.

I wasn't sure whether "set to read-only" mode here is referring to something that controls the mysql server or etcd/mw-config. Based on reviewing dbctl's code, it seems dbctl --scope codfw section x2 ro essentially sets a $wgLBFactoryConf['readOnlyBySection'] key for MediaWiki to find.

readOnlyBySection is documenting as setting read-only mode for specific "main" sections, where "main" refers to the distinction between newMainLB and newExternalLB, in other words s1-8 vs something like parsercache, externalstore (es text db hosts), or extension1-2. The latter external clusters are shared by all wikis and hence a "per section" read-only mode would not meaningfully be different from setting all wikis to read-only, noting that the main purpose of the configured read-only mode in MW is to ensure UI workflows are aware of read-only mode before attempting to submit a write and thus avoid dissapointing editors with a surprise submission failure. To statically configure MW such that it won't even attempt to connect or write to ExternalStore or other external clusters, one should configure MW to generally be read-only at that point as it would not be meaningfully different. See also T298876: x1 cannot be set to read only on MW which covered the same, where it was said that dbctl doesn't support setting ro readonly for x1. I assumed this meant that such command would fail, instead of giving dbctl operators the false impression that it does something useful. If that is not the case, then perhaps that is worth improving in dbctl, by limiting that option to the main sections.

I'll also note that the SqlBagOStuff interface we use for parser cache and main stash tolerates write failures gracefully, so encountering a few failed queries while read-only is fine. It behaves the same as when a Memcached query fails, which is to return false as if the key didn't exist.

Lastly, the MW-Rdbms library we use for all database interaction (both primary wiki data as well as parsercache/mainstash etc.) uses SELECT @@GLOBAL.read_only before writing to any db host, and this is cached and dynamically sets read-only mode for a few seconds to debounce writes failures.

dbctl won't do anything and there's no point in running it. But setting read_only=1 on the server should cause writes to fail gracefully. So go ahead with the master switch, omitting the read-only mode steps.

Ok, thanks

Change 817736 had a related patch set uploaded (by Marostegui; author: Marostegui):

[operations/puppet@production] site.pp: Promote db2144 to x2 codfw master

https://gerrit.wikimedia.org/r/817736

gerritbot added a project: Patch-For-Review.Jul 27 2022, 9:47 AM

Marostegui updated the task description. (Show Details)Jul 27 2022, 9:49 AM

Marostegui updated the task description. (Show Details)

Mentioned in SAL (#wikimedia-operations) [2022-07-28T05:19:41Z] <marostegui@cumin1001> START - Cookbook sre.hosts.downtime for 1:00:00 on 6 hosts with reason: Primary switchover x2 T313811

Mentioned in SAL (#wikimedia-operations) [2022-07-28T05:19:57Z] <marostegui@cumin1001> END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 6 hosts with reason: Primary switchover x2 T313811

Marostegui updated the task description. (Show Details)Jul 28 2022, 5:21 AM

So the db-switchover script looks like it won't really work with such multi-master replication setup, so I am going to use orchestrator to move the hosts along.

Marostegui updated the task description. (Show Details)Jul 28 2022, 5:41 AM

Change 817736 merged by Marostegui:

[operations/puppet@production] site.pp: Promote db2144 to x2 codfw master

https://gerrit.wikimedia.org/r/817736

Marostegui updated the task description. (Show Details)Jul 28 2022, 5:46 AM

Mentioned in SAL (#wikimedia-operations) [2022-07-28T06:00:14Z] <marostegui> Starting x2 codfw failover from db2142 to db2144 - T313811

Mentioned in SAL (#wikimedia-operations) [2022-07-28T06:00:57Z] <marostegui@cumin1001> dbctl commit (dc=all): 'Promote db2144 to x2 primary T313811', diff saved to https://phabricator.wikimedia.org/P32025 and previous config saved to /var/cache/conftool/dbconfig/20220728-060057-marostegui.json

Mentioned in SAL (#wikimedia-operations) [2022-07-28T06:07:58Z] <marostegui@cumin1001> dbctl commit (dc=all): 'Depool db2142 T313811', diff saved to https://phabricator.wikimedia.org/P32026 and previous config saved to /var/cache/conftool/dbconfig/20220728-060757-root.json

Marostegui updated the task description. (Show Details)Jul 28 2022, 6:08 AM

This was done but it required highly manual intervention.
Our tooling to switch master isn't ready at the moment to deal with circular replication.

The initial topology change failed, and I had to use orchestrator to get everything under db2144 (but not db1151). I could've done it manually as well but as we have orchestrator and the data is consider volatile, why not. It worked.

The step to perform the master switch itself (that is move everything under db2144) didn't work, as it moved it under db2144 but also reset replication on db1151 (eqiad master), which meant we ended up with db2144 being the primary master with db1151 as a slave, but not having db2144 replicating from db1151 and that replication thread got wiped. I had to manually reenable that replication grabbing the coordinates from db1151.
This caused the topology not to be circular for a minute until I got it back into its normal state.

I will create a task to analyze this and fix the tooling for future usages.

Marostegui closed this task as Resolved.Jul 28 2022, 6:12 AM

Marostegui updated the task description. (Show Details)

Maintenance_bot moved this task from In progress to Done on the DBA board.Jul 28 2022, 6:29 AM

Maintenance_bot removed a project: Patch-For-Review.

Mentioned in SAL (#wikimedia-operations) [2022-08-22T14:22:44Z] <root@cumin1001> START - Cookbook sre.hosts.downtime for 1:00:00 on 6 hosts with reason: Primary switchover x2 T313811

Mentioned in SAL (#wikimedia-operations) [2022-08-22T14:22:49Z] <root@cumin1001> END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 6 hosts with reason: Primary switchover x2 T313811

Mentioned in SAL (#wikimedia-operations) [2022-08-22T14:23:13Z] <marostegui@cumin1001> dbctl commit (dc=all): 'Set db2142 with weight 0 T313811', diff saved to https://phabricator.wikimedia.org/P32749 and previous config saved to /var/cache/conftool/dbconfig/20220822-142312-marostegui.json

Mentioned in SAL (#wikimedia-operations) [2023-01-26T09:25:13Z] <marostegui@cumin1001> dbctl commit (dc=all): 'Promote db2144 to x2 primary T313811', diff saved to https://phabricator.wikimedia.org/P43390 and previous config saved to /var/cache/conftool/dbconfig/20230126-092512-root.json

Marostegui mentioned this in T328001: Switchover x2 master (db2142 -> db2144).Jan 26 2023, 9:30 AM

Switchover x2 master db2142 -> db2144
Closed, ResolvedPublic
Actions

Description

Details

Related Objects
Search...

Event Timeline

Switchover x2 master db2142 -> db2144Closed, ResolvedPublicActions

Description

Details

Related ObjectsSearch...

Event Timeline

Switchover x2 master db2142 -> db2144
Closed, ResolvedPublic
Actions

Related Objects
Search...