Page MenuHomePhabricator

Primary-replica switchover automation
Open, In Progress, MediumPublic

Description

The current process for master-replica switchover e.g. https://phabricator.wikimedia.org/T409818 requires multiple manual steps.
This task is to add automation to:

  • Add new safety checks around host and section health
  • Replace copypasting during the switchover process
  • Leverage data now available in the Zarcillo DB
  • Enable extensive unit/functional test (pytest)
  • Enable end-to-end integration tests on testbed T400056
  • Have it fully documented on wikitech so it can be used by any op
  • have a dry-run feature where it goes over each step, but doesn't really change anything
    • Provide timestamps for each step executed
      • Total read_only time on a MySQL level
    • More pre-flight checks such as
      • is pt-heartbeat running on the current master?
    • Make heartbeat migration more robust until it is migrated to a systemd service or moved remotely (so it is automatic and etcd-dependent)
    • It alters or checks some master-related variables automatically (pt-config-diff h=localhost /etc/my.cnf ?):
      • Alter expire_log_days variable
      • Alter gtid mode automatically
      • Alter semi-sync automatically

Improving testability and confidence on the automation to then implement switchover when the old master is unreachable (T196366) and later on implement emergency failover in T384810

Incremental implementation + test progress:

  • functional test
  • run against test-s4 section
  • run against prod on secondary DC
  • run against prod primary DC

Details

Related Changes in Gerrit:
Related Changes in GitLab:
TitleReferenceAuthorSource BranchDest Branch
Draft: Allowing changing a host's replication masterrepos/sre/wmfmariadbpy!18fcerattoT373436switchover-helper
Add switchover helperrepos/sre/wmfmariadbpy!17fcerattoswitchover-helpermain
Customize query in GitLab

Event Timeline

FCeratto-WMF moved this task from Triage to Ready on the DBA board.
FCeratto-WMF changed the task status from Open to In Progress.Nov 17 2025, 12:35 PM
FCeratto-WMF triaged this task as Medium priority.
FCeratto-WMF moved this task from Ready to In progress on the DBA board.

Moving code from https://gerrit.wikimedia.org/r/c/operations/cookbooks/+/1129904 into https://gitlab.wikimedia.org/repos/sre/wmfmariadbpy to:

  • initially provide a CLI tool for the team
  • later on a library that can be used in a switchover cookbook by other SREs

Change #1129904 had a related patch set uploaded (by Federico Ceratto; author: Federico Ceratto):

[operations/cookbooks@master] Add switchover cookbook

https://gerrit.wikimedia.org/r/1129904

1Test switchover in s8, active DC codfw, db2165 -> db2161
2Fallback without Zarcillo
3fetching dbconfig JSON
4mock-fetching https://noc.wikimedia.org/dbconfig/codfw.json
5Fetching https://raw.githubusercontent.com/wikimedia/puppet/production/hieradata/hosts/db2166.yaml
6mock-fetching https://raw.githubusercontent.com/wikimedia/puppet/production/hieradata/hosts/db2166.yaml
7Fetching https://raw.githubusercontent.com/wikimedia/puppet/production/hieradata/hosts/db2195.yaml
8mock-fetching https://raw.githubusercontent.com/wikimedia/puppet/production/hieradata/hosts/db2195.yaml
9Fetching https://raw.githubusercontent.com/wikimedia/puppet/production/hieradata/hosts/db2181.yaml
10mock-fetching https://raw.githubusercontent.com/wikimedia/puppet/production/hieradata/hosts/db2181.yaml
11Fetching https://raw.githubusercontent.com/wikimedia/puppet/production/hieradata/hosts/db2167.yaml
12mock-fetching https://raw.githubusercontent.com/wikimedia/puppet/production/hieradata/hosts/db2167.yaml
13Fetching https://raw.githubusercontent.com/wikimedia/puppet/production/hieradata/hosts/db2152.yaml
14mock-fetching https://raw.githubusercontent.com/wikimedia/puppet/production/hieradata/hosts/db2152.yaml
15Fetching https://raw.githubusercontent.com/wikimedia/puppet/production/hieradata/hosts/db2164.yaml
16mock-fetching https://raw.githubusercontent.com/wikimedia/puppet/production/hieradata/hosts/db2164.yaml
17Fetching https://raw.githubusercontent.com/wikimedia/puppet/production/hieradata/hosts/db2163.yaml
18mock-fetching https://raw.githubusercontent.com/wikimedia/puppet/production/hieradata/hosts/db2163.yaml
19Fetching https://raw.githubusercontent.com/wikimedia/puppet/production/hieradata/hosts/db2161.yaml
20mock-fetching https://raw.githubusercontent.com/wikimedia/puppet/production/hieradata/hosts/db2161.yaml
21Fetching https://raw.githubusercontent.com/wikimedia/puppet/production/hieradata/hosts/db2154.yaml
22mock-fetching https://raw.githubusercontent.com/wikimedia/puppet/production/hieradata/hosts/db2154.yaml
23mock-fetching https://config-master.wikimedia.org/mediawiki.yaml
24Old primary: db2165
25Candidates: ['db2161']
26DC: codfw. Note: This is a switchover in the PRIMARY DC
27asking: Lock section on zarcillo
28▶ Check configuration differences between new and old primary:
29[switchover_helper.check_dbctl] Check dbctl conf
30Old primary db2165 dbctl struct: {'s8': {'candidate_master': False, 'percentage': 100, 'pooled': True, 'weight': 500}}
31new primary db2161 dbctl struct: {'s8': {'candidate_master': True, 'percentage': 100, 'pooled': True, 'weight': 500}}
32[switchover_helper.check_vars] Comparing MariaDB variables
33mock-running <<sudo db-mysql db2165 -e 'SHOW VARIABLES' -N -B>>
34mock-running <<sudo db-mysql db2161 -e 'SHOW VARIABLES' -N -B>>
35MariaDB variables check:
36 ✓ innodb_buffer_pool_size
37 ✓ innodb_log_write_ahead_size
38 ✓ innodb_flush_log_at_trx_commit
39 ✓ innodb_file_per_table
40 ✓ binlog_format
41 ✓ gtid_mode
42 ✓ sync_binlog
43 ✓ log_slave_updates
44 ✓ max_connections
45 ✓ max_allowed_packet
46 ✓ sql_mode
47 ✓ character_set_server
48 ✓ collation_server
49 ✓ db2165 is read-write (the DC is active)
50asking: Silence alerts on all hosts
51[switchover_helper.downtime] Setting downtime on A:db-section-s8
52asking: Set new primary db2161 dbctl weight to 0
53[switchover_helper.dbctl] Waiting for dbctl diff to be empty
54[switchover_helper.set_weight] Setting weight for db2161 to 0
55<dbctl.instance.weight announce_message>
56▶ Topology change: Move all replicas under the new primary
57asking: Execute sudo db-switchover --timeout=25 --only-slave-move db2165 db2161
58[switchover_helper.move_replicas] Moving replicas to db2161
59asking: Topology changes, move all replicas under the new primary
60mock-running <<sudo db-switchover --timeout=25 --only-slave-move db2165 db2161>>
61asking: Disable puppet on old primary db2165
62asking: Disable puppet on new primary db2161
63▶ Merge gerrit puppet change to promote the primary
64▶ DIY: run this after merging on Gerrit: ssh puppetserver1001.eqiad.wmnet -t sudo -i puppet-merge
65asking: Continue?
66▶ Entering primary failover section
67asking: Log the failover on irc
68asking: Set section s8 in read-only in dbctl?
69[switchover_helper.section_readonly] Setting section s8 read-only
70▶ Check that s8 is indeed read-only
71asking: Switch primaries
72[switchover_helper.switch_primary] Switching db2165 db2161 in s8
73mock-running <<sudo db-switchover --skip-slave-move db2165 db2161>>
74Checking replication status
75mock-running <<sudo db-mysql db2165 -e 'SHOW SLAVE STATUS' -B>>
76 ✓ db2165 is the primary source and not following replication
77 ✓ db2165 is not replicating (confirmed as current primary)
78mock-running <<sudo db-mysql db2161 -e 'SHOW SLAVE STATUS' -B>>
79 ✖ db2161 Slave_IO_Running: None
80 ✖ db2161 Slave_SQL_Running: None
81 ✓ db2161 no replication errors
82 ✖ db2161 invalid Seconds_Behind_Master: None
83 ℹ db2161 is replicating from port None
84 ⚠ db2161 is replicating from , not db2165
85asking: Promote new primary in dbctl and set section in read-write
86▶ Entering cleanup phase
87asking: Clean up heartbeat table(s) on new primary db2161
88mock-running <<sudo db-mysql db2161 heartbeat -e "DELETE FROM heartbeat WHERE file LIKE 'db2165%';">>
89asking: Enable and run puppet on old primary db2165
90[switchover_helper.run_puppet] Run puppet on old primary
91asking: Enable and run on new primary db2161
92[switchover_helper.run_puppet] Run puppet on new primary
93asking: Run set-master query on db2161
94▶ Changing events for query killer
95asking: Run set-replica query on db2165
96▶ Changing events for query killer
97▶ DIY: Merge the related CR for DNS configuration in Puppet then run ssh dns1004.wikimedia.org -t sudo authdns-update
98asking: Confirm when done
99asking: Update candidate primary dbctl setting db2165 as candidate and db2161 not candidate
100asking: Update Orchestrator candidate tags
101mock-running <<sudo cumin 'dborch*' 'orchestrator-client -c untag -i db2161 --tag name=candidate'>>
102mock-running <<sudo cumin 'dborch*' 'orchestrat3. Depooling / removal from API, vslow, dump groupsor-client -c tag -i db2165 --tag name=candidate'>>
103asking: Depool db2165
104▶ Completed

Some of the comments from our irc discussion:

  • The fetching from an external source should be informative and shouldn't break/block the switchover if unavailable. If unavailable, it should warn the operator about it so the candidate master host can be confirmed by some other (manual) means.
  • The lock of zarcillo isn't really locking anything, which is how it should be.
  • dbctl commands should show the diff and ask for confirmation by the operator
FCeratto-WMF renamed this task from Improve master-replica switchover (flip) automation to Primary-replica switchover automation.Dec 15 2025, 8:13 AM

Change #1129904 abandoned by Federico Ceratto:

[operations/cookbooks@master] Add switchover cookbook

Reason:

The tool has been moved to https://gitlab.wikimedia.org/repos/sre/wmfmariadbpy/-/merge_requests/17

https://gerrit.wikimedia.org/r/1129904