Page MenuHomePhabricator

Test database master switchover script on codfw
Closed, ResolvedPublic

Description

There is a work in progress script to automate a safe master failover from on server to another.
https://gerrit.wikimedia.org/r/#/c/operations/software/wmfmariadbpy/+/439871/

Do some production testing on passive hosts (codfw) before using for real on es1014 and db1052.

Related Objects

Event Timeline

Reminder, we have 4 hosts as spare now in eqiad: T196376

Some progress reporting:

$ python3
Python 3.4.2 (default, Oct  8 2014, 10:45:20) 
[GCC 4.9.1] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import WMFMariaDB
>>> master=WMFMariaDB.WMFMariaDB(host='db1095')
>>> import WMFReplication
>>> master_replication=WMFReplication.WMFReplication(master)
>>> master_replication.debug()
db1095.eqiad.wmnet:3306/(none)> master: db1067.eqiad.wmnet:3306, io: Yes, sql: Yes, pos: db1067-bin.001280/704908984
>>> master_replication.stop_slave()
{'master_ssl_cipher': '', 'master_server_id': 171974720, 'master_ssl_key': '', 'last_errno': 0, 'last_sql_errno': 0, 'replicate_ignore_server_ids': '', 'last_io_error': '', 'replicate_wild_ignore_table': '', 'replicate_ignore_domain_ids': '', 'relay_master_log_file': 'db1067-bin.001280', 'slave_io_running': 'No', 'replicate_wild_do_table': '', 'master_ssl_verify_server_cert': 'No', 'gtid_io_pos': '', 'seconds_behind_master': None, 'master_host': 'db1067.eqiad.wmnet', 'parallel_mode': 'conservative', 'relay_log_pos': 99104951, 'relay_log_space': 100145534, 'using_gtid': 'No', 'skip_counter': 0, 'read_master_log_pos': 705765165, 'success': True, 'master_port': 3306, 'relay_log_file': 'db1095-relay-bin.000018', 'exec_master_log_pos': 705765165, 'until_log_file': '', 'master_ssl_allowed': 'Yes', 'master_user': 'repl', 'master_ssl_cert': '', 'until_log_pos': 0, 'replicate_do_domain_ids': '', 'replicate_ignore_table': '', 'last_sql_error': '', 'replicate_ignore_db': '', 'master_ssl_ca_file': '', 'replicate_do_table': '', 'connect_retry': 60, 'until_condition': 'None', 'slave_sql_running': 'No', 'last_io_errno': 0, 'master_log_file': 'db1067-bin.001280', 'last_error': '', 'replicate_do_db': '', 'master_ssl_crlpath': '', 'master_ssl_crl': '', 'slave_io_state': '', 'master_ssl_ca_path': ''}
>>> master_replication.debug()
db1095.eqiad.wmnet:3306/(none)> master: db1067.eqiad.wmnet:3306, io: No, sql: No, pos: db1067-bin.001280/705765165
>>> master_replication.debug()
db1095.eqiad.wmnet:3306/(none)> master: db1067.eqiad.wmnet:3306, io: No, sql: No, pos: db1067-bin.001280/705765165
>>> master_replication.reset_slave()
{'database': None, 'success': True, 'numrows': 0, 'port': 3306, 'query': 'RESET SLAVE ALL', 'rows': None, 'host': 'db1095.eqiad.wmnet', 'fields': None}
>>> master_replication.debug()
db1095.eqiad.wmnet:3306/(none)> Not configured as a slave

0.7 seconds on performing a switchover with no writes ongoing (easy case):

root@neodymium:~/wmfmariadbpy/wmfmariadbpy$ time ./switchover.py db1095 db1102
Starting preflight checks...
Setting up original master as read-only
Slave caught up to the master after waiting 0.008965492248535156 seconds
Servers sync at master: db1095-bin.000006:576902809 slave: db1102-bin.000001:576902795
Stopping original master->slave replication
Setting up replica as read-write
All commands where successful, current status: original master read_only: 1 / original slave read_only: 0
Trying to invert replication direction
Verifying everything went as expected...
SUCCESS: Master switch completed successfully

real    0m0.710s
user    0m0.220s
sys     0m0.008s
root@neodymium:~/wmfmariadbpy/wmfmariadbpy$ time ./switchover.py db1102 db1095
Starting preflight checks...
Setting up original master as read-only
Slave caught up to the master after waiting 0.0075452327728271484 seconds
Servers sync at master: db1102-bin.000001:576902795 slave: db1095-bin.000006:576902809
Stopping original master->slave replication
Setting up replica as read-write
All commands where successful, current status: original master read_only: 1 / original slave read_only: 0
Trying to invert replication direction
Verifying everything went as expected...
SUCCESS: Master switch completed successfully

real    0m0.697s
user    0m0.208s
sys     0m0.016s

Change 447365 had a related patch set uploaded (by Jcrespo; owner: Jcrespo):
[operations/puppet@production] mariadb: Migrate db1118 from mysql 8 to mariadb 10.1

https://gerrit.wikimedia.org/r/447365

Change 447365 merged by Jcrespo:
[operations/puppet@production] mariadb: Migrate db1118 from mysql 8 to mariadb 10.1

https://gerrit.wikimedia.org/r/447365

What the first tests on production-like hosts look like:

root@neodymium:~/wmfmariadbpy/wmfmariadbpy$ time ./move_replica.py db1118 db1095
Purging binary logs to speed up coordinate finding and waiting 5.0 seconds...
Stopping replication on the instance
Stopped at master position: db1102-bin.000015:30633013, slave position db1118-bin.000007:82334
Last GTID executed: BEGIN GTID 171974720-171974720-31485244
Position found on the master at: db1095-bin.000021:82334
SUCCESS: Topology change completed successfully

real    0m5.605s
user    0m0.308s
sys     0m0.028s
root@neodymium:~/wmfmariadbpy/wmfmariadbpy$ time ./move_replica.py db1118 db1102
Purging binary logs to speed up coordinate finding and waiting 5.0 seconds...
Stopping replication on the instance
Stopped at master position: db1095-bin.000021:731731, slave position db1118-bin.000008:253914
Last GTID executed: BEGIN GTID 171974720-171974720-31486754
Position found on the master at: db1102-bin.000016:253930
SUCCESS: Topology change completed successfully

real    0m5.717s
user    0m0.416s
sys     0m0.060s
root@neodymium:~/wmfmariadbpy/wmfmariadbpy$ time ./move_replica.py db1118 db1095
Purging binary logs to speed up coordinate finding and waiting 5.0 seconds...
Stopping replication on the instance
Stopped at master position: db1102-bin.000016:1024067, slave position db1118-bin.000009:452398
Last GTID executed: BEGIN GTID 171974720-171974720-31488149
Position found on the master at: db1095-bin.000022:452398
SUCCESS: Topology change completed successfully

real    0m5.717s
user    0m0.420s
sys     0m0.020s
root@neodymium:~/wmfmariadbpy/wmfmariadbpy$ time ./move_replica.py db1118 db1102
Purging binary logs to speed up coordinate finding and waiting 5.0 seconds...
Stopping replication on the instance
Stopped at master position: db1095-bin.000022:1519934, slave position db1118-bin.000010:140801
Last GTID executed: BEGIN GTID 171974720-171974720-31489628
Position found on the master at: db1102-bin.000014:11087634
SUCCESS: Topology change completed successfully

real    1m30.972s
user    1m17.256s
sys     0m3.680s

This last one was incorrectly detected, so checking now that.

root@neodymium:~/wmfmariadbpy/wmfmariadbpy$ for host in db1095 db1102 db1118; do mysql.py -h $host -e "set sql_log_bin=0; FLUSH BINAR
Y LOGS;"; done; time ./switchover.py db1095 db1118
Starting preflight checks...
* Original read only values are as expected (master: read_only=0, slave: read_only=1)
* The host to fail over is a direct replica of the master
* Replication is up and running between the 2 hosts
* The replication lag is acceptable: 0 (lower than the configured or default timeout)
* The master is not a replica of any other host
Testing if to migrate db1118.eqiad.wmnet:3306/(none)...
Nope
Testing if to migrate db1102.eqiad.wmnet:3306/(none)...
Stopped at master position: db1095-bin.000046:384, slave position db1102-bin.000037:400
Last GTID not executed: BEGIN GTID 171974784-171974784-30197
Position found on the master at: db1118-bin.000034:400
Migrated db1102.eqiad.wmnet:3306/(none) successfully from master to replica
Setting up original master as read-only
Slave caught up to the master after waiting 0.010107278823852539 seconds
Servers sync at master: db1095-bin.000046:1124 slave: db1118-bin.000034:998
Stopping original master->slave replication 
Setting up replica as read-write
All commands where successful, current status: original master read_only: 1 / original slave read_only: 0
Trying to invert replication direction
Verifying everything went as expected...
SUCCESS: Master switch completed successfully

real    0m1.820s
user    0m0.368s
sys     0m0.040s

root@neodymium:~/wmfmariadbpy/wmfmariadbpy$ for host in db1095 db1102 db1118; do mysql.py -h $host -e "set sql_log_bin=0; FLUSH BINARY LOGS;"; done; time ./switchover.py db1118 db1095
Starting preflight checks...
* Original read only values are as expected (master: read_only=0, slave: read_only=1)
* The host to fail over is a direct replica of the master
* Replication is up and running between the 2 hosts
* The replication lag is acceptable: 0 (lower than the configured or default timeout)
* The master is not a replica of any other host
Testing if to migrate db1102.eqiad.wmnet:3306/(none)...
Stopped at master position: db1118-bin.000035:786, slave position db1102-bin.000038:715
Last GTID not executed: BEGIN GTID 171970572-171970572-35
Position found on the master at: db1095-bin.000047:699
Migrated db1102.eqiad.wmnet:3306/(none) successfully from master to replica
Testing if to migrate db1095.eqiad.wmnet:3306/(none)...
Nope
Setting up original master as read-only
Slave caught up to the master after waiting 0.0103302001953125 seconds
Servers sync at master: db1118-bin.000035:1156 slave: db1095-bin.000047:998
Stopping original master->slave replication
Setting up replica as read-write
All commands where successful, current status: original master read_only: 1 / original slave read_only: 0
Trying to invert replication direction
Verifying everything went as expected...
SUCCESS: Master switch completed successfully

real    0m1.794s
user    0m0.364s
sys     0m0.028s

Change 447595 had a related patch set uploaded (by Jcrespo; owner: Jcrespo):
[operations/software/wmfmariadbpy@master] switchover: Make posible replica migration an optional, separate step

https://gerrit.wikimedia.org/r/447595

Change 447595 merged by Jcrespo:
[operations/software/wmfmariadbpy@master] switchover: Make posible replica migration an optional, separate step

https://gerrit.wikimedia.org/r/447595

Change 447603 had a related patch set uploaded (by Jcrespo; owner: Jcrespo):
[operations/software/wmfmariadbpy@master] switchover: Add the functionality to start and stop heartbeat

https://gerrit.wikimedia.org/r/447603

It works now fully, including pt-heartbeat handling:

root@neodymium:~/wmfmariadbpy/wmfmariadbpy$ ./switchover.py --skip-slave-move db1102 db1095
Starting preflight checks...
* Original read only values are as expected (master: read_only=0, slave: read_only=1)
* The host to fail over is a direct replica of the master
* Replication is up and running between the 2 hosts
* The replication lag is acceptable: 0 (lower than the configured or default timeout)
* The master is not a replica of any other host
----- OUTPUT of '/bin/ps --no-hea...pid,args -C perl' -----                                                                          
 9405 /usr/bin/perl /usr/local/bin/pt-heartbeat-wikimedia --defaults-file=/root/.my.cnf -D heartbeat --shard=test-s1 --datacenter=eqiad --update --replace --interval=1 --set-vars=binlog_format=STATEMENT -S /run/mysqld/mysqld.sock --daemonize --pid /var/run/pt-heartbeat.pid
================                                                                                                                     
PASS:  |████████████████████████████████████████████████████████████████████████████████| 100% (1/1) [00:00<00:00,  3.20hosts/s]     
FAIL:  |                                                                                        |   0% (0/1) [00:00<?, ?hosts/s]     
100.0% (1/1) success ratio (>= 100.0% threshold) for command: '/bin/ps --no-hea...pid,args -C perl'.
100.0% (1/1) success ratio (>= 100.0% threshold) of nodes successfully executed all commands.
Stopping heartbeat pid 9405 at db1102.eqiad.wmnet:3306/(none)
----- OUTPUT of '/bin/kill 9405' -----                                                                                               
================                                                                                                                     
PASS:  |████████████████████████████████████████████████████████████████████████████████| 100% (1/1) [00:00<00:00,  3.58hosts/s]     
FAIL:  |                                                                                        |   0% (0/1) [00:00<?, ?hosts/s]     
100.0% (1/1) success ratio (>= 100.0% threshold) for command: '/bin/kill 9405'.
100.0% (1/1) success ratio (>= 100.0% threshold) of nodes successfully executed all commands.
Setting up original master as read-only
Slave caught up to the master after waiting 0.010645627975463867 seconds
Servers sync at master: db1102-bin.000040:1011408046 slave: db1095-bin.000049:1011453017
Stopping original master->slave replication
Setting up replica as read-write
All commands where successful, current status: original master read_only: 1 / original slave read_only: 0
Trying to invert replication direction
Starting heartbeat section test-s1 at db1095.eqiad.wmnet
----- OUTPUT of '/usr/bin/nohup /...d &> /dev/null &' -----                                                                          
================                                                                                                                     
PASS:  |████████████████████████████████████████████████████████████████████████████████| 100% (1/1) [00:00<00:00,  3.98hosts/s]     
FAIL:  |                                                                                        |   0% (0/1) [00:00<?, ?hosts/s]     
100.0% (1/1) success ratio (>= 100.0% threshold) for command: '/usr/bin/nohup /...d &> /dev/null &'.
100.0% (1/1) success ratio (>= 100.0% threshold) of nodes successfully executed all commands.
----- OUTPUT of '/bin/ps --no-hea...pid,args -C perl' -----                                                                          
17695 /usr/bin/perl /usr/local/bin/pt-heartbeat-wikimedia --defaults-file=/root/.my.cnf -D heartbeat --shard=test-s1 --datacenter=eqiad --update --replace --interval=1 --set-vars=binlog_format=STATEMENT -S /run/mysqld/mysqld.sock --daemonize --pid /var/run/pt-heartbeat.pid
================                                                                                                                     
PASS:  |████████████████████████████████████████████████████████████████████████████████| 100% (1/1) [00:00<00:00,  3.90hosts/s]     
FAIL:  |                                                                                        |   0% (0/1) [00:00<?, ?hosts/s]     
100.0% (1/1) success ratio (>= 100.0% threshold) for command: '/bin/ps --no-hea...pid,args -C perl'.
100.0% (1/1) success ratio (>= 100.0% threshold) of nodes successfully executed all commands.
Detected heartbeat at db1095.eqiad.wmnet running with PID 17695
Verifying everything went as expected...
SUCCESS: Master switch completed successfully
root@neodymium:~/wmfmariadbpy/wmfmariadbpy$ ./switchover.py --skip-slave-move db1095 db1102
Starting preflight checks...
* Original read only values are as expected (master: read_only=0, slave: read_only=1)
* The host to fail over is a direct replica of the master
* Replication is up and running between the 2 hosts
* The replication lag is acceptable: 0 (lower than the configured or default timeout)
* The master is not a replica of any other host
----- OUTPUT of '/bin/ps --no-hea...pid,args -C perl' -----                                                                          
17695 /usr/bin/perl /usr/local/bin/pt-heartbeat-wikimedia --defaults-file=/root/.my.cnf -D heartbeat --shard=test-s1 --datacenter=eqiad --update --replace --interval=1 --set-vars=binlog_format=STATEMENT -S /run/mysqld/mysqld.sock --daemonize --pid /var/run/pt-heartbeat.pid
================                                                                                                                     
PASS:  |████████████████████████████████████████████████████████████████████████████████| 100% (1/1) [00:00<00:00,  3.66hosts/s]     
FAIL:  |                                                                                        |   0% (0/1) [00:00<?, ?hosts/s]     
100.0% (1/1) success ratio (>= 100.0% threshold) for command: '/bin/ps --no-hea...pid,args -C perl'.
100.0% (1/1) success ratio (>= 100.0% threshold) of nodes successfully executed all commands.
Stopping heartbeat pid 17695 at db1095.eqiad.wmnet:3306/(none)
----- OUTPUT of '/bin/kill 17695' -----                                                                                              
================                                                                                                                     
PASS:  |████████████████████████████████████████████████████████████████████████████████| 100% (1/1) [00:00<00:00,  4.10hosts/s]     
FAIL:  |                                                                                        |   0% (0/1) [00:00<?, ?hosts/s]     
100.0% (1/1) success ratio (>= 100.0% threshold) for command: '/bin/kill 17695'.
100.0% (1/1) success ratio (>= 100.0% threshold) of nodes successfully executed all commands.
Setting up original master as read-only
Slave caught up to the master after waiting 0.01145792007446289 seconds
Servers sync at master: db1095-bin.000049:1011467915 slave: db1102-bin.000040:1011420097
Stopping original master->slave replication
Setting up replica as read-write
All commands where successful, current status: original master read_only: 1 / original slave read_only: 0
Trying to invert replication direction
Starting heartbeat section test-s1 at db1102.eqiad.wmnet
----- OUTPUT of '/usr/bin/nohup /...d &> /dev/null &' -----                                                                          
================                                                                                                                     
PASS:  |████████████████████████████████████████████████████████████████████████████████| 100% (1/1) [00:00<00:00,  3.43hosts/s]     
FAIL:  |                                                                                        |   0% (0/1) [00:00<?, ?hosts/s]     
100.0% (1/1) success ratio (>= 100.0% threshold) for command: '/usr/bin/nohup /...d &> /dev/null &'.
100.0% (1/1) success ratio (>= 100.0% threshold) of nodes successfully executed all commands.
----- OUTPUT of '/bin/ps --no-hea...pid,args -C perl' -----                                                                          
 9747 /usr/bin/perl /usr/local/bin/pt-heartbeat-wikimedia --defaults-file=/root/.my.cnf -D heartbeat --shard=test-s1 --datacenter=eqiad --update --replace --interval=1 --set-vars=binlog_format=STATEMENT -S /run/mysqld/mysqld.sock --daemonize --pid /var/run/pt-heartbeat.pid
================                                                                                                                     
PASS:  |████████████████████████████████████████████████████████████████████████████████| 100% (1/1) [00:00<00:00,  3.55hosts/s]     
FAIL:  |                                                                                        |   0% (0/1) [00:00<?, ?hosts/s]     
100.0% (1/1) success ratio (>= 100.0% threshold) for command: '/bin/ps --no-hea...pid,args -C perl'.
100.0% (1/1) success ratio (>= 100.0% threshold) of nodes successfully executed all commands.
Detected heartbeat at db1102.eqiad.wmnet running with PID 9747
Verifying everything went as expected...
SUCCESS: Master switch completed successfully

This is brilliant. A great step towards getting the failover automation we have always talked about.
Excellent job!
Looking forward to seeing this working tomorrow.

Change 447603 merged by Jcrespo:
[operations/software/wmfmariadbpy@master] switchover: Add the functionality to start and stop heartbeat

https://gerrit.wikimedia.org/r/447603

Codfw testing:

root@neodymium:~/wmfmariadbpy/wmfmariadbpy$ ./switchover.py --skip-slave-move es2018 es2017
Starting preflight checks...
* Original read only values are as expected (master: read_only=0, slave: read_only=1)
* The host to fail over is a direct replica of the master
* Replication is up and running between the 2 hosts
* The replication lag is acceptable: 0 (lower than the configured or default timeout)
* The master is not a replica of any other host
----- OUTPUT of '/bin/ps --no-hea...pid,args -C perl' -----                                                                          
================                                                                                                                     
PASS:  |                                                                                        |   0% (0/1) [00:01<?, ?hosts/s]     
FAIL:  |████████████████████████████████████████████████████████████████████████████████| 100% (1/1) [00:01<00:00,  1.10s/hosts]     
100.0% (1/1) of nodes failed to execute command '/bin/ps --no-hea...pid,args -C perl': es2018.codfw.wmnet
100.0% (1/1) of nodes failed to execute command '/bin/ps --no-hea...pid,args -C perl': es2018.codfw.wmnet
0.0% (0/1) success ratio (< 100.0% threshold) for command: '/bin/ps --no-hea...pid,args -C perl'. Aborting.
0.0% (0/1) success ratio (< 100.0% threshold) for command: '/bin/ps --no-hea...pid,args -C perl'. Aborting.
0.0% (0/1) success ratio (< 100.0% threshold) of nodes successfully executed all commands. Aborting.
0.0% (0/1) success ratio (< 100.0% threshold) of nodes successfully executed all commands. Aborting.
[WARNING]: Could not find a pt-heartbeat process to kill, using heartbeat table to determine the section
Setting up original master as read-only
Slave caught up to the master after waiting 0.298555850982666 seconds
Servers sync at master: es2018-bin.002107:515019552 slave: es2017-bin.002103:916413841
Stopping original master->slave replication
Setting up replica as read-write
All commands where successful, current status: original master read_only: 1 / original slave read_only: 0
Trying to invert replication direction
Starting heartbeat section b'es3' at es2017.codfw.wmnet
----- OUTPUT of '/usr/bin/nohup /...d &> /dev/null &' -----                                                                          
================                                                                                                                     
PASS:  |████████████████████████████████████████████████████████████████████████████████| 100% (1/1) [00:00<00:00,  1.05hosts/s]     
FAIL:  |                                                                                        |   0% (0/1) [00:00<?, ?hosts/s]     
100.0% (1/1) success ratio (>= 100.0% threshold) for command: '/usr/bin/nohup /...d &> /dev/null &'.
100.0% (1/1) success ratio (>= 100.0% threshold) of nodes successfully executed all commands.
----- OUTPUT of '/bin/ps --no-hea...pid,args -C perl' -----                                                                          
 4692 /usr/bin/perl /usr/local/bin/pt-heartbeat-wikimedia --defaults-file=/dev/null --user=root --host=localhost -D heartbeat --shard=es3 --datacenter=codfw --update --replace --interval=1 --set-vars=binlog_format=STATEMENT -S /run/mysqld/mysqld.sock --daemonize --pid /var/run/pt-heartbeat.pid
37349 /usr/bin/perl /usr/local/bin/pt-heartbeat-wikimedia --defaults-file=/root/.my.cnf -D heartbeat --shard=bes3 --datacenter=codfw --update --replace --interval=1 --set-vars=binlog_format=STATEMENT -S /run/mysqld/mysqld.sock --daemonize --pid /var/run/pt-heartbeat.pid
================                                                                                                                     
PASS:  |████████████████████████████████████████████████████████████████████████████████| 100% (1/1) [00:00<00:00,  1.45hosts/s]     
FAIL:  |                                                                                        |   0% (0/1) [00:00<?, ?hosts/s]     
100.0% (1/1) success ratio (>= 100.0% threshold) for command: '/bin/ps --no-hea...pid,args -C perl'.
100.0% (1/1) success ratio (>= 100.0% threshold) of nodes successfully executed all commands.
Detected heartbeat at es2017.codfw.wmnet running with PID 37349
Verifying everything went as expected...
SUCCESS: Master switch completed successfully

Change 447648 had a related patch set uploaded (by Jcrespo; owner: Jcrespo):
[operations/software/wmfmariadbpy@master] mariadb: Fix heartbeat regex

https://gerrit.wikimedia.org/r/447648

Change 447648 merged by Jcrespo:
[operations/software/wmfmariadbpy@master] mariadb: Fix heartbeat regex

https://gerrit.wikimedia.org/r/447648

jcrespo claimed this task.

So the issue we had with pt-heartbeat was that the code ran by puppet wasn't actually running there- it doesn't get killed and restarted if it is running, with older parameters.

We need to kill heartbeat and restart it on all masters with the older paramaters. Alternatively, move heartbeat to a systemd-controled service.

Aside from this, I am happy with the testing done- I did a couple of extra runs to make sure everything was ok.

Thanks for the work you've put here. This is a nice work towards getting less downtime and less human errors while doing critical failovers.

Change 447659 had a related patch set uploaded (by Jcrespo; owner: Jcrespo):
[operations/software/wmfmariadbpy@master] switchover: Fix bug where shards are added with an extra 'b'

https://gerrit.wikimedia.org/r/447659

Change 447659 merged by Jcrespo:
[operations/software/wmfmariadbpy@master] switchover: Fix bug where shards are added with an extra 'b'

https://gerrit.wikimedia.org/r/447659