Maniphest T199224

Test database master switchover script on codfw
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	• jcrespo
	Jul 10 2018, 1:35 PM

Description

There is a work in progress script to automate a safe master failover from on server to another.
https://gerrit.wikimedia.org/r/#/c/operations/software/wmfmariadbpy/+/439871/

Do some production testing on passive hosts (codfw) before using for real on es1014 and db1052.

Details

Subject	Repo	Branch	Lines +/-
switchover: Fix bug where shards are added with an extra 'b'	operations/software/wmfmariadbpy	master	+2 -2
mariadb: Fix heartbeat regex	operations/software/wmfmariadbpy	master	+1 -1
switchover: Add the functionality to start and stop heartbeat	operations/software/wmfmariadbpy	master	+99 -3
switchover: Make posible replica migration an optional, separate step	operations/software/wmfmariadbpy	master	+18 -7
mariadb: Migrate db1118 from mysql 8 to mariadb 10.1	operations/puppet	production	+2 -1

Customize query in gerrit

Related Objects
Search...

Status	Assigned	Task
Invalid	ayounsi	T199142 Increase network capacity (2018-19 Q1 Goal)
Resolved	• Cmjohnson	T183585 Rack/cable/configure asw2-b-eqiad switch stack
Resolved	• jcrespo	T197073 switchover es1014 to es1017
Resolved	None	T186320 Decommission db1051-db1060 (DBA tracking)
Resolved	• Marostegui	T197069 Failover db1052 (s1) db primary master
Resolved	LSobanski	T156461 [META ticket] Automation for our DBs tracking task
Resolved	• jcrespo	T199224 Test database master switchover script on codfw

Event Timeline

• jcrespo created this task.Jul 10 2018, 1:35 PM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptJul 10 2018, 1:35 PM

• jcrespo triaged this task as High priority.Jul 10 2018, 1:35 PM

• jcrespo moved this task from Triage to Pending comment on the DBA board.

• jcrespo added parent tasks: T197069: Failover db1052 (s1) db primary master, T197073: switchover es1014 to es1017.

Reminder, we have 4 hosts as spare now in eqiad: T196376

Some progress reporting:

$ python3
Python 3.4.2 (default, Oct  8 2014, 10:45:20) 
[GCC 4.9.1] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import WMFMariaDB
>>> master=WMFMariaDB.WMFMariaDB(host='db1095')
>>> import WMFReplication
>>> master_replication=WMFReplication.WMFReplication(master)
>>> master_replication.debug()
db1095.eqiad.wmnet:3306/(none)> master: db1067.eqiad.wmnet:3306, io: Yes, sql: Yes, pos: db1067-bin.001280/704908984
>>> master_replication.stop_slave()
{'master_ssl_cipher': '', 'master_server_id': 171974720, 'master_ssl_key': '', 'last_errno': 0, 'last_sql_errno': 0, 'replicate_ignore_server_ids': '', 'last_io_error': '', 'replicate_wild_ignore_table': '', 'replicate_ignore_domain_ids': '', 'relay_master_log_file': 'db1067-bin.001280', 'slave_io_running': 'No', 'replicate_wild_do_table': '', 'master_ssl_verify_server_cert': 'No', 'gtid_io_pos': '', 'seconds_behind_master': None, 'master_host': 'db1067.eqiad.wmnet', 'parallel_mode': 'conservative', 'relay_log_pos': 99104951, 'relay_log_space': 100145534, 'using_gtid': 'No', 'skip_counter': 0, 'read_master_log_pos': 705765165, 'success': True, 'master_port': 3306, 'relay_log_file': 'db1095-relay-bin.000018', 'exec_master_log_pos': 705765165, 'until_log_file': '', 'master_ssl_allowed': 'Yes', 'master_user': 'repl', 'master_ssl_cert': '', 'until_log_pos': 0, 'replicate_do_domain_ids': '', 'replicate_ignore_table': '', 'last_sql_error': '', 'replicate_ignore_db': '', 'master_ssl_ca_file': '', 'replicate_do_table': '', 'connect_retry': 60, 'until_condition': 'None', 'slave_sql_running': 'No', 'last_io_errno': 0, 'master_log_file': 'db1067-bin.001280', 'last_error': '', 'replicate_do_db': '', 'master_ssl_crlpath': '', 'master_ssl_crl': '', 'slave_io_state': '', 'master_ssl_ca_path': ''}
>>> master_replication.debug()
db1095.eqiad.wmnet:3306/(none)> master: db1067.eqiad.wmnet:3306, io: No, sql: No, pos: db1067-bin.001280/705765165
>>> master_replication.debug()
db1095.eqiad.wmnet:3306/(none)> master: db1067.eqiad.wmnet:3306, io: No, sql: No, pos: db1067-bin.001280/705765165
>>> master_replication.reset_slave()
{'database': None, 'success': True, 'numrows': 0, 'port': 3306, 'query': 'RESET SLAVE ALL', 'rows': None, 'host': 'db1095.eqiad.wmnet', 'fields': None}
>>> master_replication.debug()
db1095.eqiad.wmnet:3306/(none)> Not configured as a slave

0.7 seconds on performing a switchover with no writes ongoing (easy case):

root@neodymium:~/wmfmariadbpy/wmfmariadbpy$ time ./switchover.py db1095 db1102
Starting preflight checks...
Setting up original master as read-only
Slave caught up to the master after waiting 0.008965492248535156 seconds
Servers sync at master: db1095-bin.000006:576902809 slave: db1102-bin.000001:576902795
Stopping original master->slave replication
Setting up replica as read-write
All commands where successful, current status: original master read_only: 1 / original slave read_only: 0
Trying to invert replication direction
Verifying everything went as expected...
SUCCESS: Master switch completed successfully

real    0m0.710s
user    0m0.220s
sys     0m0.008s
root@neodymium:~/wmfmariadbpy/wmfmariadbpy$ time ./switchover.py db1102 db1095
Starting preflight checks...
Setting up original master as read-only
Slave caught up to the master after waiting 0.0075452327728271484 seconds
Servers sync at master: db1102-bin.000001:576902795 slave: db1095-bin.000006:576902809
Stopping original master->slave replication
Setting up replica as read-write
All commands where successful, current status: original master read_only: 1 / original slave read_only: 0
Trying to invert replication direction
Verifying everything went as expected...
SUCCESS: Master switch completed successfully

real    0m0.697s
user    0m0.208s
sys     0m0.016s

• jcrespo mentioned this in rOSMD3b21d76b7b81: Add replication managing.Jul 20 2018, 12:27 PM

• jcrespo mentioned this in rOSMD280336b2bab6: Add replication managing.Jul 20 2018, 12:33 PM

• jcrespo mentioned this in rOSMDfd953d6567bd: Add replication managing.Jul 20 2018, 12:35 PM

Change 447365 had a related patch set uploaded (by Jcrespo; owner: Jcrespo):
[operations/puppet@production] mariadb: Migrate db1118 from mysql 8 to mariadb 10.1

https://gerrit.wikimedia.org/r/447365

Change 447365 merged by Jcrespo:
[operations/puppet@production] mariadb: Migrate db1118 from mysql 8 to mariadb 10.1

https://gerrit.wikimedia.org/r/447365

What the first tests on production-like hosts look like:

root@neodymium:~/wmfmariadbpy/wmfmariadbpy$ time ./move_replica.py db1118 db1095
Purging binary logs to speed up coordinate finding and waiting 5.0 seconds...
Stopping replication on the instance
Stopped at master position: db1102-bin.000015:30633013, slave position db1118-bin.000007:82334
Last GTID executed: BEGIN GTID 171974720-171974720-31485244
Position found on the master at: db1095-bin.000021:82334
SUCCESS: Topology change completed successfully

real    0m5.605s
user    0m0.308s
sys     0m0.028s
root@neodymium:~/wmfmariadbpy/wmfmariadbpy$ time ./move_replica.py db1118 db1102
Purging binary logs to speed up coordinate finding and waiting 5.0 seconds...
Stopping replication on the instance
Stopped at master position: db1095-bin.000021:731731, slave position db1118-bin.000008:253914
Last GTID executed: BEGIN GTID 171974720-171974720-31486754
Position found on the master at: db1102-bin.000016:253930
SUCCESS: Topology change completed successfully

real    0m5.717s
user    0m0.416s
sys     0m0.060s
root@neodymium:~/wmfmariadbpy/wmfmariadbpy$ time ./move_replica.py db1118 db1095
Purging binary logs to speed up coordinate finding and waiting 5.0 seconds...
Stopping replication on the instance
Stopped at master position: db1102-bin.000016:1024067, slave position db1118-bin.000009:452398
Last GTID executed: BEGIN GTID 171974720-171974720-31488149
Position found on the master at: db1095-bin.000022:452398
SUCCESS: Topology change completed successfully

real    0m5.717s
user    0m0.420s
sys     0m0.020s
root@neodymium:~/wmfmariadbpy/wmfmariadbpy$ time ./move_replica.py db1118 db1102
Purging binary logs to speed up coordinate finding and waiting 5.0 seconds...
Stopping replication on the instance
Stopped at master position: db1095-bin.000022:1519934, slave position db1118-bin.000010:140801
Last GTID executed: BEGIN GTID 171974720-171974720-31489628
Position found on the master at: db1102-bin.000014:11087634
SUCCESS: Topology change completed successfully

real    1m30.972s
user    1m17.256s
sys     0m3.680s

This last one was incorrectly detected, so checking now that.

root@neodymium:~/wmfmariadbpy/wmfmariadbpy$ for host in db1095 db1102 db1118; do mysql.py -h $host -e "set sql_log_bin=0; FLUSH BINAR
Y LOGS;"; done; time ./switchover.py db1095 db1118
Starting preflight checks...
* Original read only values are as expected (master: read_only=0, slave: read_only=1)
* The host to fail over is a direct replica of the master
* Replication is up and running between the 2 hosts
* The replication lag is acceptable: 0 (lower than the configured or default timeout)
* The master is not a replica of any other host
Testing if to migrate db1118.eqiad.wmnet:3306/(none)...
Nope
Testing if to migrate db1102.eqiad.wmnet:3306/(none)...
Stopped at master position: db1095-bin.000046:384, slave position db1102-bin.000037:400
Last GTID not executed: BEGIN GTID 171974784-171974784-30197
Position found on the master at: db1118-bin.000034:400
Migrated db1102.eqiad.wmnet:3306/(none) successfully from master to replica
Setting up original master as read-only
Slave caught up to the master after waiting 0.010107278823852539 seconds
Servers sync at master: db1095-bin.000046:1124 slave: db1118-bin.000034:998
Stopping original master->slave replication 
Setting up replica as read-write
All commands where successful, current status: original master read_only: 1 / original slave read_only: 0
Trying to invert replication direction
Verifying everything went as expected...
SUCCESS: Master switch completed successfully

real    0m1.820s
user    0m0.368s
sys     0m0.040s

root@neodymium:~/wmfmariadbpy/wmfmariadbpy$ for host in db1095 db1102 db1118; do mysql.py -h $host -e "set sql_log_bin=0; FLUSH BINARY LOGS;"; done; time ./switchover.py db1118 db1095
Starting preflight checks...
* Original read only values are as expected (master: read_only=0, slave: read_only=1)
* The host to fail over is a direct replica of the master
* Replication is up and running between the 2 hosts
* The replication lag is acceptable: 0 (lower than the configured or default timeout)
* The master is not a replica of any other host
Testing if to migrate db1102.eqiad.wmnet:3306/(none)...
Stopped at master position: db1118-bin.000035:786, slave position db1102-bin.000038:715
Last GTID not executed: BEGIN GTID 171970572-171970572-35
Position found on the master at: db1095-bin.000047:699
Migrated db1102.eqiad.wmnet:3306/(none) successfully from master to replica
Testing if to migrate db1095.eqiad.wmnet:3306/(none)...
Nope
Setting up original master as read-only
Slave caught up to the master after waiting 0.0103302001953125 seconds
Servers sync at master: db1118-bin.000035:1156 slave: db1095-bin.000047:998
Stopping original master->slave replication
Setting up replica as read-write
All commands where successful, current status: original master read_only: 1 / original slave read_only: 0
Trying to invert replication direction
Verifying everything went as expected...
SUCCESS: Master switch completed successfully

real    0m1.794s
user    0m0.364s
sys     0m0.028s

Change 447595 had a related patch set uploaded (by Jcrespo; owner: Jcrespo):
[operations/software/wmfmariadbpy@master] switchover: Make posible replica migration an optional, separate step

https://gerrit.wikimedia.org/r/447595

• jcrespo mentioned this in rOSMD3ebc0f7ab21b: switchover: Make posible replica migration an optional, separate step.Jul 24 2018, 9:37 AM

• jcrespo mentioned this in rOSMD8a003a26e3d8: switchover: Make posible replica migration an optional, separate step.Jul 24 2018, 9:48 AM

Change 447595 merged by Jcrespo:
[operations/software/wmfmariadbpy@master] switchover: Make posible replica migration an optional, separate step

https://gerrit.wikimedia.org/r/447595

Change 447603 had a related patch set uploaded (by Jcrespo; owner: Jcrespo):
[operations/software/wmfmariadbpy@master] switchover: Add the functionality to start and stop heartbeat

https://gerrit.wikimedia.org/r/447603

• jcrespo mentioned this in rOSMD91b58be12424: switchover: Add the functionality to start and stop heartbeat.Jul 24 2018, 11:02 AM

• jcrespo mentioned this in rOSMDee708238e5e7: switchover: Add the functionality to start and stop heartbeat.Jul 24 2018, 11:05 AM

• jcrespo mentioned this in rOSMD7b4fb9317083: switchover: Add the functionality to start and stop heartbeat.Jul 24 2018, 11:31 AM

• jcrespo mentioned this in rOSMDc31b43c4d304: switchover: Add the functionality to start and stop heartbeat.Jul 24 2018, 11:37 AM

• jcrespo mentioned this in rOSMD394feaeecae7: switchover: Add the functionality to start and stop heartbeat.Jul 24 2018, 11:43 AM

• jcrespo mentioned this in rOSMD67193e78d445: switchover: Add the functionality to start and stop heartbeat.Jul 24 2018, 12:03 PM

• jcrespo mentioned this in rOSMDc93dc94c4798: switchover: Add the functionality to start and stop heartbeat.

• jcrespo mentioned this in rOSMD231c4ade5c30: switchover: Add the functionality to start and stop heartbeat.Jul 24 2018, 3:11 PM

• jcrespo mentioned this in rOSMD05511f020e56: switchover: Add the functionality to start and stop heartbeat.Jul 24 2018, 3:21 PM

It works now fully, including pt-heartbeat handling:

root@neodymium:~/wmfmariadbpy/wmfmariadbpy$ ./switchover.py --skip-slave-move db1102 db1095
Starting preflight checks...
* Original read only values are as expected (master: read_only=0, slave: read_only=1)
* The host to fail over is a direct replica of the master
* Replication is up and running between the 2 hosts
* The replication lag is acceptable: 0 (lower than the configured or default timeout)
* The master is not a replica of any other host
----- OUTPUT of '/bin/ps --no-hea...pid,args -C perl' -----                                                                          
 9405 /usr/bin/perl /usr/local/bin/pt-heartbeat-wikimedia --defaults-file=/root/.my.cnf -D heartbeat --shard=test-s1 --datacenter=eqiad --update --replace --interval=1 --set-vars=binlog_format=STATEMENT -S /run/mysqld/mysqld.sock --daemonize --pid /var/run/pt-heartbeat.pid
================                                                                                                                     
PASS:  |████████████████████████████████████████████████████████████████████████████████| 100% (1/1) [00:00<00:00,  3.20hosts/s]     
FAIL:  |                                                                                        |   0% (0/1) [00:00<?, ?hosts/s]     
100.0% (1/1) success ratio (>= 100.0% threshold) for command: '/bin/ps --no-hea...pid,args -C perl'.
100.0% (1/1) success ratio (>= 100.0% threshold) of nodes successfully executed all commands.
Stopping heartbeat pid 9405 at db1102.eqiad.wmnet:3306/(none)
----- OUTPUT of '/bin/kill 9405' -----                                                                                               
================                                                                                                                     
PASS:  |████████████████████████████████████████████████████████████████████████████████| 100% (1/1) [00:00<00:00,  3.58hosts/s]     
FAIL:  |                                                                                        |   0% (0/1) [00:00<?, ?hosts/s]     
100.0% (1/1) success ratio (>= 100.0% threshold) for command: '/bin/kill 9405'.
100.0% (1/1) success ratio (>= 100.0% threshold) of nodes successfully executed all commands.
Setting up original master as read-only
Slave caught up to the master after waiting 0.010645627975463867 seconds
Servers sync at master: db1102-bin.000040:1011408046 slave: db1095-bin.000049:1011453017
Stopping original master->slave replication
Setting up replica as read-write
All commands where successful, current status: original master read_only: 1 / original slave read_only: 0
Trying to invert replication direction
Starting heartbeat section test-s1 at db1095.eqiad.wmnet
----- OUTPUT of '/usr/bin/nohup /...d &> /dev/null &' -----                                                                          
================                                                                                                                     
PASS:  |████████████████████████████████████████████████████████████████████████████████| 100% (1/1) [00:00<00:00,  3.98hosts/s]     
FAIL:  |                                                                                        |   0% (0/1) [00:00<?, ?hosts/s]     
100.0% (1/1) success ratio (>= 100.0% threshold) for command: '/usr/bin/nohup /...d &> /dev/null &'.
100.0% (1/1) success ratio (>= 100.0% threshold) of nodes successfully executed all commands.
----- OUTPUT of '/bin/ps --no-hea...pid,args -C perl' -----                                                                          
17695 /usr/bin/perl /usr/local/bin/pt-heartbeat-wikimedia --defaults-file=/root/.my.cnf -D heartbeat --shard=test-s1 --datacenter=eqiad --update --replace --interval=1 --set-vars=binlog_format=STATEMENT -S /run/mysqld/mysqld.sock --daemonize --pid /var/run/pt-heartbeat.pid
================                                                                                                                     
PASS:  |████████████████████████████████████████████████████████████████████████████████| 100% (1/1) [00:00<00:00,  3.90hosts/s]     
FAIL:  |                                                                                        |   0% (0/1) [00:00<?, ?hosts/s]     
100.0% (1/1) success ratio (>= 100.0% threshold) for command: '/bin/ps --no-hea...pid,args -C perl'.
100.0% (1/1) success ratio (>= 100.0% threshold) of nodes successfully executed all commands.
Detected heartbeat at db1095.eqiad.wmnet running with PID 17695
Verifying everything went as expected...
SUCCESS: Master switch completed successfully
root@neodymium:~/wmfmariadbpy/wmfmariadbpy$ ./switchover.py --skip-slave-move db1095 db1102
Starting preflight checks...
* Original read only values are as expected (master: read_only=0, slave: read_only=1)
* The host to fail over is a direct replica of the master
* Replication is up and running between the 2 hosts
* The replication lag is acceptable: 0 (lower than the configured or default timeout)
* The master is not a replica of any other host
----- OUTPUT of '/bin/ps --no-hea...pid,args -C perl' -----                                                                          
17695 /usr/bin/perl /usr/local/bin/pt-heartbeat-wikimedia --defaults-file=/root/.my.cnf -D heartbeat --shard=test-s1 --datacenter=eqiad --update --replace --interval=1 --set-vars=binlog_format=STATEMENT -S /run/mysqld/mysqld.sock --daemonize --pid /var/run/pt-heartbeat.pid
================                                                                                                                     
PASS:  |████████████████████████████████████████████████████████████████████████████████| 100% (1/1) [00:00<00:00,  3.66hosts/s]     
FAIL:  |                                                                                        |   0% (0/1) [00:00<?, ?hosts/s]     
100.0% (1/1) success ratio (>= 100.0% threshold) for command: '/bin/ps --no-hea...pid,args -C perl'.
100.0% (1/1) success ratio (>= 100.0% threshold) of nodes successfully executed all commands.
Stopping heartbeat pid 17695 at db1095.eqiad.wmnet:3306/(none)
----- OUTPUT of '/bin/kill 17695' -----                                                                                              
================                                                                                                                     
PASS:  |████████████████████████████████████████████████████████████████████████████████| 100% (1/1) [00:00<00:00,  4.10hosts/s]     
FAIL:  |                                                                                        |   0% (0/1) [00:00<?, ?hosts/s]     
100.0% (1/1) success ratio (>= 100.0% threshold) for command: '/bin/kill 17695'.
100.0% (1/1) success ratio (>= 100.0% threshold) of nodes successfully executed all commands.
Setting up original master as read-only
Slave caught up to the master after waiting 0.01145792007446289 seconds
Servers sync at master: db1095-bin.000049:1011467915 slave: db1102-bin.000040:1011420097
Stopping original master->slave replication
Setting up replica as read-write
All commands where successful, current status: original master read_only: 1 / original slave read_only: 0
Trying to invert replication direction
Starting heartbeat section test-s1 at db1102.eqiad.wmnet
----- OUTPUT of '/usr/bin/nohup /...d &> /dev/null &' -----                                                                          
================                                                                                                                     
PASS:  |████████████████████████████████████████████████████████████████████████████████| 100% (1/1) [00:00<00:00,  3.43hosts/s]     
FAIL:  |                                                                                        |   0% (0/1) [00:00<?, ?hosts/s]     
100.0% (1/1) success ratio (>= 100.0% threshold) for command: '/usr/bin/nohup /...d &> /dev/null &'.
100.0% (1/1) success ratio (>= 100.0% threshold) of nodes successfully executed all commands.
----- OUTPUT of '/bin/ps --no-hea...pid,args -C perl' -----                                                                          
 9747 /usr/bin/perl /usr/local/bin/pt-heartbeat-wikimedia --defaults-file=/root/.my.cnf -D heartbeat --shard=test-s1 --datacenter=eqiad --update --replace --interval=1 --set-vars=binlog_format=STATEMENT -S /run/mysqld/mysqld.sock --daemonize --pid /var/run/pt-heartbeat.pid
================                                                                                                                     
PASS:  |████████████████████████████████████████████████████████████████████████████████| 100% (1/1) [00:00<00:00,  3.55hosts/s]     
FAIL:  |                                                                                        |   0% (0/1) [00:00<?, ?hosts/s]     
100.0% (1/1) success ratio (>= 100.0% threshold) for command: '/bin/ps --no-hea...pid,args -C perl'.
100.0% (1/1) success ratio (>= 100.0% threshold) of nodes successfully executed all commands.
Detected heartbeat at db1102.eqiad.wmnet running with PID 9747
Verifying everything went as expected...
SUCCESS: Master switch completed successfully

This is brilliant. A great step towards getting the failover automation we have always talked about.
Excellent job!
Looking forward to seeing this working tomorrow.

• Marostegui awarded a token.Jul 24 2018, 3:32 PM

• jcrespo mentioned this in rOSMD4e96bac40457: switchover: Add the functionality to start and stop heartbeat.Jul 24 2018, 3:38 PM

Change 447603 merged by Jcrespo:
[operations/software/wmfmariadbpy@master] switchover: Add the functionality to start and stop heartbeat

https://gerrit.wikimedia.org/r/447603

Codfw testing:

root@neodymium:~/wmfmariadbpy/wmfmariadbpy$ ./switchover.py --skip-slave-move es2018 es2017
Starting preflight checks...
* Original read only values are as expected (master: read_only=0, slave: read_only=1)
* The host to fail over is a direct replica of the master
* Replication is up and running between the 2 hosts
* The replication lag is acceptable: 0 (lower than the configured or default timeout)
* The master is not a replica of any other host
----- OUTPUT of '/bin/ps --no-hea...pid,args -C perl' -----                                                                          
================                                                                                                                     
PASS:  |                                                                                        |   0% (0/1) [00:01<?, ?hosts/s]     
FAIL:  |████████████████████████████████████████████████████████████████████████████████| 100% (1/1) [00:01<00:00,  1.10s/hosts]     
100.0% (1/1) of nodes failed to execute command '/bin/ps --no-hea...pid,args -C perl': es2018.codfw.wmnet
100.0% (1/1) of nodes failed to execute command '/bin/ps --no-hea...pid,args -C perl': es2018.codfw.wmnet
0.0% (0/1) success ratio (< 100.0% threshold) for command: '/bin/ps --no-hea...pid,args -C perl'. Aborting.
0.0% (0/1) success ratio (< 100.0% threshold) for command: '/bin/ps --no-hea...pid,args -C perl'. Aborting.
0.0% (0/1) success ratio (< 100.0% threshold) of nodes successfully executed all commands. Aborting.
0.0% (0/1) success ratio (< 100.0% threshold) of nodes successfully executed all commands. Aborting.
[WARNING]: Could not find a pt-heartbeat process to kill, using heartbeat table to determine the section
Setting up original master as read-only
Slave caught up to the master after waiting 0.298555850982666 seconds
Servers sync at master: es2018-bin.002107:515019552 slave: es2017-bin.002103:916413841
Stopping original master->slave replication
Setting up replica as read-write
All commands where successful, current status: original master read_only: 1 / original slave read_only: 0
Trying to invert replication direction
Starting heartbeat section b'es3' at es2017.codfw.wmnet
----- OUTPUT of '/usr/bin/nohup /...d &> /dev/null &' -----                                                                          
================                                                                                                                     
PASS:  |████████████████████████████████████████████████████████████████████████████████| 100% (1/1) [00:00<00:00,  1.05hosts/s]     
FAIL:  |                                                                                        |   0% (0/1) [00:00<?, ?hosts/s]     
100.0% (1/1) success ratio (>= 100.0% threshold) for command: '/usr/bin/nohup /...d &> /dev/null &'.
100.0% (1/1) success ratio (>= 100.0% threshold) of nodes successfully executed all commands.
----- OUTPUT of '/bin/ps --no-hea...pid,args -C perl' -----                                                                          
 4692 /usr/bin/perl /usr/local/bin/pt-heartbeat-wikimedia --defaults-file=/dev/null --user=root --host=localhost -D heartbeat --shard=es3 --datacenter=codfw --update --replace --interval=1 --set-vars=binlog_format=STATEMENT -S /run/mysqld/mysqld.sock --daemonize --pid /var/run/pt-heartbeat.pid
37349 /usr/bin/perl /usr/local/bin/pt-heartbeat-wikimedia --defaults-file=/root/.my.cnf -D heartbeat --shard=bes3 --datacenter=codfw --update --replace --interval=1 --set-vars=binlog_format=STATEMENT -S /run/mysqld/mysqld.sock --daemonize --pid /var/run/pt-heartbeat.pid
================                                                                                                                     
PASS:  |████████████████████████████████████████████████████████████████████████████████| 100% (1/1) [00:00<00:00,  1.45hosts/s]     
FAIL:  |                                                                                        |   0% (0/1) [00:00<?, ?hosts/s]     
100.0% (1/1) success ratio (>= 100.0% threshold) for command: '/bin/ps --no-hea...pid,args -C perl'.
100.0% (1/1) success ratio (>= 100.0% threshold) of nodes successfully executed all commands.
Detected heartbeat at es2017.codfw.wmnet running with PID 37349
Verifying everything went as expected...
SUCCESS: Master switch completed successfully

Change 447648 had a related patch set uploaded (by Jcrespo; owner: Jcrespo):
[operations/software/wmfmariadbpy@master] mariadb: Fix heartbeat regex

https://gerrit.wikimedia.org/r/447648

• jcrespo mentioned this in rOSMD93ef5c22bff2: mariadb: Fix heartbeat regex.Jul 24 2018, 4:52 PM

Mentioned in SAL (#wikimedia-operations) [2018-07-24T16:58:00Z] <jynus> finishing test on es3 hosts T199224

Change 447648 merged by Jcrespo:
[operations/software/wmfmariadbpy@master] mariadb: Fix heartbeat regex

https://gerrit.wikimedia.org/r/447648

So the issue we had with pt-heartbeat was that the code ran by puppet wasn't actually running there- it doesn't get killed and restarted if it is running, with older parameters.

We need to kill heartbeat and restart it on all masters with the older paramaters. Alternatively, move heartbeat to a systemd-controled service.

Aside from this, I am happy with the testing done- I did a couple of extra runs to make sure everything was ok.

Thanks for the work you've put here. This is a nice work towards getting less downtime and less human errors while doing critical failovers.

Change 447659 had a related patch set uploaded (by Jcrespo; owner: Jcrespo):
[operations/software/wmfmariadbpy@master] switchover: Fix bug where shards are added with an extra 'b'

https://gerrit.wikimedia.org/r/447659

Change 447659 merged by Jcrespo:
[operations/software/wmfmariadbpy@master] switchover: Fix bug where shards are added with an extra 'b'

https://gerrit.wikimedia.org/r/447659

• jcrespo mentioned this in rOSMD4c51152b0f7c: switchover: Fix bug where shards are added with an extra 'b'.Jul 24 2018, 6:21 PM

• Marostegui mentioned this in T156465: For switchovers: A way to check if slaves are up to date.Jul 25 2018, 10:40 AM

• jcrespo mentioned this in rOSWBfd953d6567bd: Add replication managing.Sep 1 2020, 8:36 AM

• jcrespo mentioned this in rOSWB8a003a26e3d8: switchover: Make posible replica migration an optional, separate step.

• jcrespo mentioned this in rOSWB4e96bac40457: switchover: Add the functionality to start and stop heartbeat.

• jcrespo mentioned this in rOSWB93ef5c22bff2: mariadb: Fix heartbeat regex.

• jcrespo mentioned this in rOSWB4c51152b0f7c: switchover: Fix bug where shards are added with an extra 'b'.