Page MenuHomePhabricator

Switchover x2 master db2142 -> db2144
Closed, ResolvedPublic

Description

Due to some on-site maintenance on codfw we need to switch x2 master before Aug 2nd
This is the first time we'll do a master switchover on an active-active service, so it will be interesting to see how long it takes...

When: Thursday 28th 06:00 AM UTC

  • Team calendar invite

Affected wikis:: x2

Checklist:

NEW primary: db2144
OLD primary: db2142

  • Check configuration differences between new and old primary:
sudo pt-config-diff --defaults-file /root/.my.cnf h=db2142.codfw.wmnet h=db2144.codfw.wmnet

Failover prep:

  • Silence alerts on all hosts:
sudo cookbook sre.hosts.downtime --hours 1 -r "Primary switchover x2 T313811" 'A:db-section-x2'
  • Set NEW primary with weight 0 (and depool it from API or vslow/dump groups if it is present).
sudo dbctl instance db2144 set-weight 0
sudo dbctl config commit -m "Set db2144 with weight 0 T313811"
  • Topology changes, move all replicas under NEW primary
This didn't work and I had to use orchestrator: sudo db-switchover --replicating-master --timeout=25 --only-slave-move db2142 db2144
  • Disable puppet on both nodes
sudo cumin 'db2142* or db2144*' 'disable-puppet "primary switchover T313811"'

Failover:

  • Log the failover:
!log Starting x2 codfw failover from db2142 to db2144 - T313811
  • Switch primaries:
sudo db-switchover --replicating-master --skip-slave-move db2142 db2144
echo "===== db2142 (OLD)"; sudo db-mysql db2142 -e 'show slave status\G'
echo "===== db2144 (NEW)"; sudo db-mysql db2144 -e 'show slave status\G'
  • Promote NEW primary in dbctl, and remove read-only
sudo dbctl --scope codfw section x2 set-master db2144
sudo dbctl config commit -m "Promote db2144 to x2 primary T313811"
  • Restart puppet on both hosts:
sudo cumin 'db2142* or db2144*' 'run-puppet-agent -e "primary switchover T313811"'

Clean up tasks:

  • Clean up heartbeat table(s).
sudo db-mysql db2144 heartbeat -e "delete from heartbeat where file like 'db2142%';"
sudo db-mysql db1151 heartbeat -e "delete from heartbeat where file like 'db2142%';"
  • change events for query killer:
events_coredb_master.sql on the new primary db2144
events_coredb_slave.sql on the new slave db2142
  • Update candidate primary dbctl and orchestrator notes
sudo dbctl instance db2142 set-candidate-master --section x2 true
sudo dbctl instance db2144 set-candidate-master --section x2 false
(dborch1001): sudo orchestrator-client -c untag -i db2144 --tag name=candidate
(dborch1001): sudo orchestrator-client -c tag -i db2142 --tag name=candidate
sudo db-mysql db1115 zarcillo -e "select * from masters where section = 'x2';"
  • (If needed): Depool db2142 for maintenance.
sudo dbctl instance db2142 depool
sudo dbctl config commit -m "Depool db2142 T313811"
  • Change db2142 weight to mimic the previous weight db2144:
sudo dbctl instance db2142 edit
  • Update/resolve this ticket.

Event Timeline

Marostegui renamed this task from Switchover x2 master to Switchover x2 master db2142 -> db2143.Jul 26 2022, 2:44 PM
Marostegui updated the task description. (Show Details)
Marostegui added subscribers: tstarling, Krinkle.

@Krinkle @tstarling can x2 be set to RO entirely for a few seconds? Is that done via dbctl I would assume like any other chain with: dbctl --scope codfw section x2 ro "maintenance"

Marostegui updated the task description. (Show Details)
Marostegui moved this task from Triage to In progress on the DBA board.

Mentioned in SAL (#wikimedia-operations) [2022-07-26T14:51:16Z] <marostegui@cumin1001> dbctl commit (dc=all): 'Set db2143 with weight 0 T313811', diff saved to https://phabricator.wikimedia.org/P31951 and previous config saved to /var/cache/conftool/dbconfig/20220726-145116-root.json

Mentioned in SAL (#wikimedia-operations) [2022-07-26T14:54:13Z] <marostegui@cumin1001> dbctl commit (dc=all): 'Set db2144 with weight 0 and db2143 back with 100 T313811', diff saved to https://phabricator.wikimedia.org/P31952 and previous config saved to /var/cache/conftool/dbconfig/20220726-145412-root.json

Marostegui renamed this task from Switchover x2 master db2142 -> db2143 to Switchover x2 master db2142 -> db2144.Jul 26 2022, 2:55 PM
Marostegui updated the task description. (Show Details)

Let's use db2144 instead, db2143's rack (b5) will go on maintenance the following day or even the same one.

I wasn't sure whether "set to read-only" mode here is referring to something that controls the mysql server or etcd/mw-config. Based on reviewing dbctl's code, it seems dbctl --scope codfw section x2 ro essentially sets a $wgLBFactoryConf['readOnlyBySection'] key for MediaWiki to find.

readOnlyBySection is documenting as setting read-only mode for specific "main" sections, where "main" refers to the distinction between newMainLB and newExternalLB, in other words s1-8 vs something like parsercache, externalstore (es text db hosts), or extension1-2. The latter external clusters are shared by all wikis and hence a "per section" read-only mode would not meaningfully be different from setting all wikis to read-only, noting that the main purpose of the configured read-only mode in MW is to ensure UI workflows are aware of read-only mode before attempting to submit a write and thus avoid dissapointing editors with a surprise submission failure. To statically configure MW such that it won't even attempt to connect or write to ExternalStore or other external clusters, one should configure MW to generally be read-only at that point as it would not be meaningfully different. See also T298876: x1 cannot be set to read only on MW which covered the same, where it was said that dbctl doesn't support setting ro readonly for x1. I assumed this meant that such command would fail, instead of giving dbctl operators the false impression that it does something useful. If that is not the case, then perhaps that is worth improving in dbctl, by limiting that option to the main sections.

I'll also note that the SqlBagOStuff interface we use for parser cache and main stash tolerates write failures gracefully, so encountering a few failed queries while read-only is fine. It behaves the same as when a Memcached query fails, which is to return false as if the key didn't exist.

Lastly, the MW-Rdbms library we use for all database interaction (both primary wiki data as well as parsercache/mainstash etc.) uses SELECT @@GLOBAL.read_only before writing to any db host, and this is cached and dynamically sets read-only mode for a few seconds to debounce writes failures.

dbctl won't do anything and there's no point in running it. But setting read_only=1 on the server should cause writes to fail gracefully. So go ahead with the master switch, omitting the read-only mode steps.

Change 817736 had a related patch set uploaded (by Marostegui; author: Marostegui):

[operations/puppet@production] site.pp: Promote db2144 to x2 codfw master

https://gerrit.wikimedia.org/r/817736

Marostegui updated the task description. (Show Details)
Marostegui updated the task description. (Show Details)

Mentioned in SAL (#wikimedia-operations) [2022-07-28T05:19:41Z] <marostegui@cumin1001> START - Cookbook sre.hosts.downtime for 1:00:00 on 6 hosts with reason: Primary switchover x2 T313811

Mentioned in SAL (#wikimedia-operations) [2022-07-28T05:19:57Z] <marostegui@cumin1001> END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 6 hosts with reason: Primary switchover x2 T313811

So the db-switchover script looks like it won't really work with such multi-master replication setup, so I am going to use orchestrator to move the hosts along.

Change 817736 merged by Marostegui:

[operations/puppet@production] site.pp: Promote db2144 to x2 codfw master

https://gerrit.wikimedia.org/r/817736

Mentioned in SAL (#wikimedia-operations) [2022-07-28T06:00:14Z] <marostegui> Starting x2 codfw failover from db2142 to db2144 - T313811

Mentioned in SAL (#wikimedia-operations) [2022-07-28T06:00:57Z] <marostegui@cumin1001> dbctl commit (dc=all): 'Promote db2144 to x2 primary T313811', diff saved to https://phabricator.wikimedia.org/P32025 and previous config saved to /var/cache/conftool/dbconfig/20220728-060057-marostegui.json

Mentioned in SAL (#wikimedia-operations) [2022-07-28T06:07:58Z] <marostegui@cumin1001> dbctl commit (dc=all): 'Depool db2142 T313811', diff saved to https://phabricator.wikimedia.org/P32026 and previous config saved to /var/cache/conftool/dbconfig/20220728-060757-root.json

This was done but it required highly manual intervention.
Our tooling to switch master isn't ready at the moment to deal with circular replication.

The initial topology change failed, and I had to use orchestrator to get everything under db2144 (but not db1151). I could've done it manually as well but as we have orchestrator and the data is consider volatile, why not. It worked.

The step to perform the master switch itself (that is move everything under db2144) didn't work, as it moved it under db2144 but also reset replication on db1151 (eqiad master), which meant we ended up with db2144 being the primary master with db1151 as a slave, but not having db2144 replicating from db1151 and that replication thread got wiped. I had to manually reenable that replication grabbing the coordinates from db1151.
This caused the topology not to be circular for a minute until I got it back into its normal state.

I will create a task to analyze this and fix the tooling for future usages.

Marostegui updated the task description. (Show Details)

Mentioned in SAL (#wikimedia-operations) [2022-08-22T14:22:44Z] <root@cumin1001> START - Cookbook sre.hosts.downtime for 1:00:00 on 6 hosts with reason: Primary switchover x2 T313811

Mentioned in SAL (#wikimedia-operations) [2022-08-22T14:22:49Z] <root@cumin1001> END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 6 hosts with reason: Primary switchover x2 T313811

Mentioned in SAL (#wikimedia-operations) [2022-08-22T14:23:13Z] <marostegui@cumin1001> dbctl commit (dc=all): 'Set db2142 with weight 0 T313811', diff saved to https://phabricator.wikimedia.org/P32749 and previous config saved to /var/cache/conftool/dbconfig/20220822-142312-marostegui.json