Page MenuHomePhabricator

db-switchover got stuck while failing over db2107 to db2104
Closed, ResolvedPublic

Description

While attempting to failover db2107 to db2104 (T287454) the script got stuck and had to crtl+c after a couple of minutes (we were on read-only already), this is the output:

root@cumin2001:~# db-switchover --skip-slave-move db2107.codfw.wmnet db2104.codfw.wmnet
Starting preflight checks...
* Original read only values are as expected (master: read_only=False, slave: read_only=True)
* The host to fail over is a direct replica of the master
* Replication is up and running between the 2 hosts
* Binary log format is the same: STATEMENT
* The replication lag is acceptable: 0 (lower than the configured or default timeout)
^CTraceback (most recent call last):
  File "/usr/bin/db-switchover", line 11, in <module>
    load_entry_point('wmfmariadbpy==0.7.1', 'console_scripts', 'db-switchover')()
  File "/usr/lib/python3/dist-packages/wmfmariadbpy/cli_admin/switchover.py", line 740, in main
    handle_new_master_semisync_replication(slave)
  File "/usr/lib/python3/dist-packages/wmfmariadbpy/cli_admin/switchover.py", line 665, in handle_new_master_semisync_replication
    result = slave.execute("SET GLOBAL rpl_semi_sync_master_enabled = 1")
  File "/usr/lib/python3/dist-packages/wmfmariadbpy/WMFMariaDB.py", line 199, in execute
    cursor.execute(command)
  File "/usr/lib/python3/dist-packages/pymysql/cursors.py", line 170, in execute
    result = self._query(query)
  File "/usr/lib/python3/dist-packages/pymysql/cursors.py", line 328, in _query
    conn.query(q)
  File "/usr/lib/python3/dist-packages/pymysql/connections.py", line 517, in query
    self._affected_rows = self._read_query_result(unbuffered=unbuffered)
  File "/usr/lib/python3/dist-packages/pymysql/connections.py", line 732, in _read_query_result
    result.read()
  File "/usr/lib/python3/dist-packages/pymysql/connections.py", line 1075, in read
    first_packet = self.connection._read_packet()
  File "/usr/lib/python3/dist-packages/pymysql/connections.py", line 657, in _read_packet
    packet_header = self._read_bytes(4)
  File "/usr/lib/python3/dist-packages/pymysql/connections.py", line 691, in _read_bytes
    data = self._rfile.read(num_bytes)
  File "/usr/lib/python3.7/socket.py", line 589, in readinto
    return self._sock.recv_into(b)
  File "/usr/lib/python3.7/ssl.py", line 1052, in recv_into
    return self.read(nbytes, buffer)
  File "/usr/lib/python3.7/ssl.py", line 911, in read
    return self._sslobj.read(len, buffer)
KeyboardInterrupt

It failed from either cumin1001 and cumin2001 with same issues. Same also with and without FQDN.
It's been years since we ran this script against codfw primary masters, so there might be something there. My guess is that it got stuck while trying to connect or when trying to run a command on the mysql hosts, but I couldn't find any on show processlist of the involved hosts. I didn't have much time to debug as writes were blocked already for around 5 minutes.

db-switchover --timeout=15 --only-slave-move db2107.codfw.wmnet db2104.codfw.wmnet worked without issues.

Event Timeline

Marostegui moved this task from Triage to In progress on the DBA board.

From a side brainstorm: we could potentially run the switchover script on pc1 for low impact test purposes.

From a side brainstorm: we could potentially run the switchover script on pc1 for low impact test purposes.

I'm a bit iffy about this. It won't currently run against parsercache as the replica isn't read-only, and we don't use semi-sync replication there. There could be other config incompatibilities, too. I'm not super enthused about making the required changes to a production system (even if it's only pc1) so we can test this.

Timeline, for reference:

07:00:11 <@marostegui> going to go ahead
07:00:34 <@marostegui> !log Starting s2 codfw failover from db2107 to db2104 - T287454
...
07:01:06 <@marostegui> ro confirmed
...
07:01:58 <@marostegui> mmm the script is taking ages
...
07:04:01 <@marostegui> it is also not working from cumin in codfw
07:04:48 <@marostegui> Going to abort the maintenance

cumin1001 output:

root@cumin1001:/home/marostegui# db-switchover --skip-slave-move db2107.codfw.wmnet db2104.codfw.wmnet
Starting preflight checks...
* Original read only values are as expected (master: read_only=False, slave: read_only=True)
* The host to fail over is a direct replica of the master
* Replication is up and running between the 2 hosts
* Binary log format is the same: STATEMENT
* The replication lag is acceptable: 0 (lower than the configured or default timeout)
^CTraceback (most recent call last):
  File "/usr/bin/db-switchover", line 11, in <module>
    load_entry_point('wmfmariadbpy==0.7.1', 'console_scripts', 'db-switchover')()
  File "/usr/lib/python3/dist-packages/wmfmariadbpy/cli_admin/switchover.py", line 740, in main
    handle_new_master_semisync_replication(slave)
  File "/usr/lib/python3/dist-packages/wmfmariadbpy/cli_admin/switchover.py", line 665, in handle_new_master_semisync_replication
    result = slave.execute("SET GLOBAL rpl_semi_sync_master_enabled = 1")
  File "/usr/lib/python3/dist-packages/wmfmariadbpy/WMFMariaDB.py", line 199, in execute
    cursor.execute(command)
  File "/usr/lib/python3/dist-packages/pymysql/cursors.py", line 170, in execute
    result = self._query(query)
  File "/usr/lib/python3/dist-packages/pymysql/cursors.py", line 328, in _query
    conn.query(q)
  File "/usr/lib/python3/dist-packages/pymysql/connections.py", line 517, in query
    self._affected_rows = self._read_query_result(unbuffered=unbuffered)
  File "/usr/lib/python3/dist-packages/pymysql/connections.py", line 732, in _read_query_result
    result.read()
  File "/usr/lib/python3/dist-packages/pymysql/connections.py", line 1075, in read
    first_packet = self.connection._read_packet()
  File "/usr/lib/python3/dist-packages/pymysql/connections.py", line 657, in _read_packet
    packet_header = self._read_bytes(4)
  File "/usr/lib/python3/dist-packages/pymysql/connections.py", line 691, in _read_bytes
    data = self._rfile.read(num_bytes)
  File "/usr/lib/python3.7/socket.py", line 589, in readinto
    return self._sock.recv_into(b)
  File "/usr/lib/python3.7/ssl.py", line 1052, in recv_into
    return self.read(nbytes, buffer)
  File "/usr/lib/python3.7/ssl.py", line 911, in read
    return self._sslobj.read(len, buffer)
KeyboardInterrupt

strace at: cumin1001:/home/kormat/strace

Last successful use of db-switchover was on 2021-08-05: T287852: Failover m2 master (db1107) to a different host to upgrade its kernel
This was in eqiad.

There have been no package updates on either cumin hosts since then. I'm also not immediately finding any relevant puppet changes.

Mentioned in SAL (#wikimedia-operations) [2021-08-10T09:17:33Z] <kormat> running non-destructive test against s7/codfw (db2107/db2014) T288500

Ran a test with debugging enabled:

$ sudo DEBUG=1 db-switchover --skip-slave-move db2107.codfw.wmnet db2104.codfw.wmnet                            
Executing 'SET SESSION max_statement_time = 0'                                                                                         
Connected to db2107/(none)
Executing 'SET SESSION max_statement_time = 0'
Connected to db2104/(none)
Starting preflight checks...
Executing 'SELECT @@GLOBAL.read_only'
Executing 'SELECT @@GLOBAL.read_only'
* Original read only values are as expected (master: read_only=False, slave: read_only=True)
Executing 'SET SESSION max_statement_time = 5.0'
Executing 'SHOW SLAVE STATUS'
Executing 'SET SESSION max_statement_time = 0'
Executing 'SET SESSION max_statement_time = 0'
Connected to db2107/(none)
* The host to fail over is a direct replica of the master
Executing 'SET SESSION max_statement_time = 5.0'
Executing 'SHOW SLAVE STATUS'
Executing 'SET SESSION max_statement_time = 0'
* Replication is up and running between the 2 hosts
Executing 'SELECT @@GLOBAL.binlog_format'
Executing 'SELECT @@GLOBAL.binlog_format'
* Binary log format is the same: STATEMENT
Executing 'SET SESSION max_statement_time = 5.0'
Executing 'SHOW SLAVE STATUS'
Executing 'SET SESSION max_statement_time = 0'
* The replication lag is acceptable: 0 (lower than the configured or default timeout)
Executing 'SET SESSION max_statement_time = 5.0'
Executing 'SHOW SLAVE STATUS'
Executing 'SET SESSION max_statement_time = 0'
Executing 'SET GLOBAL rpl_semi_sync_slave_enabled = 0'
Executing 'SELECT @@VERSION'
Executing 'SET GLOBAL rpl_semi_sync_master_enabled = 1'
Are you sure you want to switchover current master db2107.codfw.wmnet and promote db2104.codfw.wmnet instead [yes/no]? no
Aborting switchover without touching anything!

It worked fine. I've reverted the semi-sync changes.

For the record, the replicas move:

1root@cumin1001:/home/marostegui# db-switchover --timeout=15 --only-slave-move db2107.codfw.wmnet db2104.codfw.wmnet
2Starting preflight checks...
3* Original read only values are as expected (master: read_only=False, slave: read_only=True)
4* The host to fail over is a direct replica of the master
5* Replication is up and running between the 2 hosts
6* Binary log format is the same: STATEMENT
7* The replication lag is acceptable: 0 (lower than the configured or default timeout)
8Disabling GTID on new master...
9Checking if db1122/(none) needs to be moved under the new master...
10Disabling GTID on db1122/(none)...
11Waiting some seconds for db to catch up...
12Reenabling GTID on db1122/(none)...
13db1122/(none) was moved successfully under the new master
14Checking if db2125/(none) needs to be moved under the new master...
15Disabling GTID on db2125/(none)...
16Waiting some seconds for db to catch up...
17/usr/lib/python3/dist-packages/pymysql/cursors.py:170: Warning: (1278, "It is recommended to use --skip-slave-start when doing step-by-step replication with START SLAVE UNTIL; otherwise, you will get problems if you get an unexpected slave's mysqld restart")
18 result = self._query(query)
19Reenabling GTID on db2125/(none)...
20db2125/(none) was moved successfully under the new master
21Checking if db2104/(none) needs to be moved under the new master...
22Nope
23Checking if db2138:3312/(none) needs to be moved under the new master...
24Disabling GTID on db2138:3312/(none)...
25Waiting some seconds for db to catch up...
26Reenabling GTID on db2138:3312/(none)...
27db2138:3312/(none) was moved successfully under the new master
28Checking if db2101:3312/(none) needs to be moved under the new master...
29Disabling GTID on db2101:3312/(none)...
30Waiting some seconds for db to catch up...
31Reenabling GTID on db2101:3312/(none)...
32db2101:3312/(none) was moved successfully under the new master
33Checking if db2097:3312/(none) needs to be moved under the new master...
34Disabling GTID on db2097:3312/(none)...
35Waiting some seconds for db to catch up...
36Reenabling GTID on db2097:3312/(none)...
37db2097:3312/(none) was moved successfully under the new master
38Checking if db2088:3312/(none) needs to be moved under the new master...
39Disabling GTID on db2088:3312/(none)...
40Waiting some seconds for db to catch up...
41Reenabling GTID on db2088:3312/(none)...
42db2088:3312/(none) was moved successfully under the new master
43Checking if db2126/(none) needs to be moved under the new master...
44Disabling GTID on db2126/(none)...
45Waiting some seconds for db to catch up...
46Reenabling GTID on db2126/(none)...
47db2126/(none) was moved successfully under the new master
48Checking if db2148/(none) needs to be moved under the new master...
49Disabling GTID on db2148/(none)...
50Waiting some seconds for db to catch up...
51Reenabling GTID on db2148/(none)...
52db2148/(none) was moved successfully under the new master
53[WARNING]: Semisync was not enabled on all hosts
54SUCCESS: All slaves moved correctly, but not continuing further because --only-slave-move

Proposal (all commands run from cumin1001):

  • Move all replicas in s2/eqiad beneath the candidate primary: sudo DEBUG=1 db-switchover --replicating-master --read-only-master --timeout=15 --only-slave-move db1122.eqiad.wmnet db1162.eqiad.wmnet
  • Simulate the primary swap, but say 'no' at the prompt: sudo DEBUG=1 db-switchover --skip-slave-move db1122.eqiad.wmnet db1162.eqiad.wmnet
  • Undo changes: sudo DEBUG=1 db-switchover --replicating-master --read-only-master --timeout=15 --only-slave-move db1162.eqiad.wmnet db1122.eqiad.wmnet

Proposal (all commands run from cumin1001):

  • Move all replicas in s2/eqiad beneath the candidate primary: sudo DEBUG=1 db-switchover --timeout=15 --only-slave-move db1122.eqiad.wmnet db1162.eqiad.wmnet
  • Simulate the primary swap, but say 'no' at the prompt: sudo DEBUG=1 db-switchover --skip-slave-move db1122.eqiad.wmnet db1162.eqiad.wmnet
  • Undo changes: sudo DEBUG=1 db-switchover --timeout=15 --only-slave-move db1162.eqiad.wmnet db1122.eqiad.wmnet

We need to run those on codfw hosts:

sudo DEBUG=1 db-switchover --timeout=15 --only-slave-move db2107.codfw.wmnet db2104.codfw.wmnet
sudo DEBUG=1 db-switchover --skip-slave-move db2107.codfw.wmnet db2104.codfw.wmnet
sudo DEBUG=1 db-switchover --timeout=15 --only-slave-move db2107.codfw.wmnet db2104.codfw.wmnet

Mentioned in SAL (#wikimedia-operations) [2021-08-10T12:23:40Z] <kormat> non-destructive (🤞) testing of db-switchover against s2/eqiad T288500

Proposal (all commands run from cumin1001):

  • Move all replicas in s2/eqiad beneath the candidate primary: sudo DEBUG=1 db-switchover --replicating-master --read-only-master --timeout=15 --only-slave-move db1122.eqiad.wmnet db1162.eqiad.wmnet
  • Simulate the primary swap, but say 'no' at the prompt: sudo DEBUG=1 db-switchover --skip-slave-move db1122.eqiad.wmnet db1162.eqiad.wmnet
  • Undo changes: sudo DEBUG=1 db-switchover --replicating-master --read-only-master --timeout=15 --only-slave-move db1162.eqiad.wmnet db1122.eqiad.wmnet

Results:

  • db-switchover failed multiple times to move db1152 (sanitarium master) under db1162, for no apparent reason
  • Manually moved the remaining nodes over using db-move-replica
  • db-switchover --skip-slave-move worked perfectly fine
  • db-switchover is unable to move replicas from a non-master to a master, so all replicas had to be moved back by hand using db-move-replica.

This time it sort of worked, sequence.

  • Replicas move went all good.
  • The first attempt to switchover:
[05:00:41] marostegui@cumin1001:/home$ sudo DEBUG=1 db-switchover --skip-slave-move db2107 db2104 ; echo "===== db2107 (OLD)"; sudo mysql.py -h db2107 -e 'show slave status\G' ; echo "===== db2104 (NEW)"; sudo mysql.py -h db2104 -e 'show slave status\G'
Executing 'SET SESSION max_statement_time = 0'
Connected to db2107/(none)
Executing 'SET SESSION max_statement_time = 0'
Connected to db2104/(none)
Starting preflight checks...
Executing 'SELECT @@GLOBAL.read_only'
Executing 'SELECT @@GLOBAL.read_only'
* Original read only values are as expected (master: read_only=False, slave: read_only=True)
Executing 'SET SESSION max_statement_time = 5.0'
Executing 'SHOW SLAVE STATUS'
Executing 'SET SESSION max_statement_time = 0'
Executing 'SET SESSION max_statement_time = 0'
Connected to db2107/(none)
* The host to fail over is a direct replica of the master
Executing 'SET SESSION max_statement_time = 5.0'
Executing 'SHOW SLAVE STATUS'
Executing 'SET SESSION max_statement_time = 0'
* Replication is up and running between the 2 hosts
Executing 'SELECT @@GLOBAL.binlog_format'
Executing 'SELECT @@GLOBAL.binlog_format'
* Binary log format is the same: STATEMENT
Executing 'SET SESSION max_statement_time = 5.0'
Executing 'SHOW SLAVE STATUS'
Executing 'SET SESSION max_statement_time = 0'
* The replication lag is acceptable: 0 (lower than the configured or default timeout)
Executing 'SET SESSION max_statement_time = 5.0'
Executing 'SHOW SLAVE STATUS'
Executing 'SET SESSION max_statement_time = 0'
Executing 'SET GLOBAL rpl_semi_sync_slave_enabled = 0'
Executing 'SELECT @@VERSION'
Executing 'SET GLOBAL rpl_semi_sync_master_enabled = 1'

It got stuck there for more than a minute, while db2104 was looking like:

| 795218262 | root            | 10.64.32.25:39432    | NULL               | Query       |     104 | NULL                                                                        | SET GLOBAL rpl_semi_sync_master_enabled = 1 |    0.000 |

Even if the query was running and never finished, the change was applied and rpl_semi_sync_master_enabled was ON already.
We decide to kill this connection, and the script continued:

ERROR 2013: Lost connection to MySQL server during query
[WARNING] Semisync could not be enabled on the new master
Are you sure you want to switchover current master db2107 and promote db2104 instead [yes/no]? yes
Stopping heartbeat on db2107/(none)
Setting up original master as read-only
Executing 'SET GLOBAL read_only = 1'
Executing 'SET SESSION max_statement_time = 5.0'
Traceback (most recent call last):
  File "/usr/bin/db-switchover", line 11, in <module>
    load_entry_point('wmfmariadbpy==0.7.1', 'console_scripts', 'db-switchover')()
  File "/usr/lib/python3/dist-packages/wmfmariadbpy/cli_admin/switchover.py", line 764, in main
    wait_for_slave_to_catch_up(master_replication, slave_replication, timeout)
  File "/usr/lib/python3/dist-packages/wmfmariadbpy/cli_admin/switchover.py", line 227, in wait_for_slave_to_catch_up
    while not slave_replication.caught_up_to_master(master_replication.connection):
  File "/usr/lib/python3/dist-packages/wmfmariadbpy/WMFReplication.py", line 465, in caught_up_to_master
    slave_status = self.slave_status()
  File "/usr/lib/python3/dist-packages/wmfmariadbpy/WMFReplication.py", line 29, in slave_status
    result = self.connection.execute("SHOW SLAVE STATUS", timeout=self.timeout)
  File "/usr/lib/python3/dist-packages/wmfmariadbpy/WMFMariaDB.py", line 185, in execute
    self.set_query_limit(timeout)
  File "/usr/lib/python3/dist-packages/wmfmariadbpy/WMFMariaDB.py", line 156, in set_query_limit
    "SET SESSION max_statement_time = {}".format(self.query_limit)
  File "/usr/lib/python3/dist-packages/wmfmariadbpy/WMFMariaDB.py", line 199, in execute
    cursor.execute(command)
  File "/usr/lib/python3/dist-packages/pymysql/cursors.py", line 170, in execute
    result = self._query(query)
  File "/usr/lib/python3/dist-packages/pymysql/cursors.py", line 328, in _query
    conn.query(q)
  File "/usr/lib/python3/dist-packages/pymysql/connections.py", line 516, in query
    self._execute_command(COMMAND.COM_QUERY, sql)
  File "/usr/lib/python3/dist-packages/pymysql/connections.py", line 750, in _execute_command
    raise err.InterfaceError("(0, '')")
pymysql.err.InterfaceError: (0, '')

We decided to go ahead and try again:

$ sudo DEBUG=1 db-switchover --skip-slave-move db2107 db2104 ; echo "===== db2107 (OLD)"; sudo mysql.py -h db2107 -e 'show slave status\G' ; echo "===== db2104 (NEW)"; sudo mysql.py -h db2104 -e 'show slave status\G'
Executing 'SET SESSION max_statement_time = 0'
Connected to db2107/(none)
Executing 'SET SESSION max_statement_time = 0'
Connected to db2104/(none)
Starting preflight checks...
Executing 'SELECT @@GLOBAL.read_only'
Executing 'SELECT @@GLOBAL.read_only'
[ERROR]: Initial read_only status check failed: original master read_only: 1 / original slave read_only: 1

The safety check triggered and the script aborted, after setting the old master to read_only=0 manually, we went ahead again and this time it went fine:

$ sudo DEBUG=1 db-switchover --skip-slave-move db2107 db2104 ; echo "===== db2107 (OLD)"; sudo mysql.py -h db2107 -e 'show slave status\G' ; echo "===== db2104 (NEW)"; sudo mysql.py -h db2104 -e 'show slave status\G'
Executing 'SET SESSION max_statement_time = 0'
Connected to db2107/(none)
Executing 'SET SESSION max_statement_time = 0'
Connected to db2104/(none)
Starting preflight checks...
Executing 'SELECT @@GLOBAL.read_only'
Executing 'SELECT @@GLOBAL.read_only'
* Original read only values are as expected (master: read_only=False, slave: read_only=True)
Executing 'SET SESSION max_statement_time = 5.0'
Executing 'SHOW SLAVE STATUS'
Executing 'SET SESSION max_statement_time = 0'
Executing 'SET SESSION max_statement_time = 0'
Connected to db2107/(none)
* The host to fail over is a direct replica of the master
Executing 'SET SESSION max_statement_time = 5.0'
Executing 'SHOW SLAVE STATUS'
Executing 'SET SESSION max_statement_time = 0'
* Replication is up and running between the 2 hosts
Executing 'SELECT @@GLOBAL.binlog_format'
Executing 'SELECT @@GLOBAL.binlog_format'
* Binary log format is the same: STATEMENT
Executing 'SET SESSION max_statement_time = 5.0'
Executing 'SHOW SLAVE STATUS'
Executing 'SET SESSION max_statement_time = 0'
* The replication lag is acceptable: 0 (lower than the configured or default timeout)
Executing 'SET SESSION max_statement_time = 5.0'
Executing 'SHOW SLAVE STATUS'
Executing 'SET SESSION max_statement_time = 0'
Executing 'SET GLOBAL rpl_semi_sync_slave_enabled = 0'
Executing 'SELECT @@VERSION'
Executing 'SET GLOBAL rpl_semi_sync_master_enabled = 1'
Are you sure you want to switchover current master db2107 and promote db2104 instead [yes/no]? yes
Stopping heartbeat on db2107/(none)
Setting up original master as read-only
Executing 'SET GLOBAL read_only = 1'
Executing 'SET SESSION max_statement_time = 5.0'
Executing 'SHOW SLAVE STATUS'
Executing 'SET SESSION max_statement_time = 0'
Executing 'SHOW MASTER STATUS'
Executing 'SET SESSION max_statement_time = 5.0'
Executing 'SHOW SLAVE STATUS'
Executing 'SET SESSION max_statement_time = 0'
Executing 'SHOW MASTER STATUS'
Slave caught up to the master after waiting 0.2649571895599365 seconds
Executing 'SET SESSION max_statement_time = 5.0'
Executing 'SHOW SLAVE STATUS'
Executing 'SET SESSION max_statement_time = 0'
Executing 'SHOW MASTER STATUS'
Servers sync at master: db2107-bin.002601:354605253 slave: db2104-bin.002568:201726269
Stopping original master->slave replication
Executing 'SET SESSION max_statement_time = 5.0'
Executing 'SHOW SLAVE STATUS'
Executing 'SET SESSION max_statement_time = 0'
Executing 'SET SESSION max_statement_time = 5.0'
Executing 'STOP SLAVE'
Executing 'SET SESSION max_statement_time = 0'
Executing 'SET SESSION max_statement_time = 5.0'
Executing 'SHOW SLAVE STATUS'
Executing 'SET SESSION max_statement_time = 0'
Setting up replica as read-write
Executing 'SET GLOBAL read_only = 0'
Executing 'SELECT @@GLOBAL.read_only'
Executing 'SELECT @@GLOBAL.read_only'
All commands where successful, current status: original master read_only: 1 / original slave read_only: 0
Trying to invert replication direction
Executing 'SET SESSION max_statement_time = 5.0'
Executing 'SHOW SLAVE STATUS'
Executing 'SET SESSION max_statement_time = 0'
Executing 'CHANGE MASTER TO
                   MASTER_HOST = 'db2104.codfw.wmnet',
                   MASTER_PORT = 3306,
                   MASTER_USER = 'repl',
                   MASTER_PASSWORD = 'xxx',
                   MASTER_LOG_FILE = 'db2104-bin.002568',
                   MASTER_LOG_POS = 201726269,
                   MASTER_SSL = 1
                 '
Executing 'SET SESSION max_statement_time = 5.0'
Executing 'SHOW SLAVE STATUS'
Executing 'SET SESSION max_statement_time = 0'
Executing 'SET SESSION max_statement_time = 5.0'
Executing 'START SLAVE'
Executing 'SET SESSION max_statement_time = 0'
Executing 'SET SESSION max_statement_time = 5.0'
Executing 'SHOW SLAVE STATUS'
Executing 'SET SESSION max_statement_time = 0'
Executing 'SET SESSION max_statement_time = 5.0'
Executing 'SHOW SLAVE STATUS'
Executing 'SET SESSION max_statement_time = 0'
Executing 'SET SESSION max_statement_time = 5.0'
Executing 'SHOW SLAVE STATUS'
Executing 'SET SESSION max_statement_time = 0'
Executing 'RESET SLAVE ALL'
Executing 'SET GLOBAL rpl_semi_sync_master_enabled = 0'
Executing 'SELECT @@VERSION'
Executing 'UNINSTALL PLUGIN rpl_semi_sync_master'
/usr/lib/python3/dist-packages/pymysql/cursors.py:170: Warning: (1620, 'Plugin is busy and will be uninstalled on shutdown')
  result = self._query(query)
Executing 'INSTALL PLUGIN rpl_semi_sync_slave SONAME 'semisync_slave.so''
Executing 'SET GLOBAL rpl_semi_sync_slave_enabled = 1'
Starting heartbeat on db2104/(none)
Verifying everything went as expected...
Executing 'SELECT @@GLOBAL.read_only'
Executing 'SELECT @@GLOBAL.read_only'
Executing 'SET SESSION max_statement_time = 5.0'
Executing 'SHOW SLAVE STATUS'
Executing 'SET SESSION max_statement_time = 0'
Executing 'SET SESSION max_statement_time = 0'
Connected to db2104/(none)
Executing 'SET SESSION max_statement_time = 5.0'
Executing 'SHOW SLAVE STATUS'
Executing 'SET SESSION max_statement_time = 0'
Executing 'SET SESSION max_statement_time = 5.0'
Executing 'SHOW SLAVE STATUS'
Executing 'SET SESSION max_statement_time = 0'
SUCCESS: Master switch completed successfully
Enabling GTID on old master...
Executing 'SET SESSION max_statement_time = 5.0'
Executing 'SHOW SLAVE STATUS'
Executing 'SET SESSION max_statement_time = 0'
Executing 'SET SESSION max_statement_time = 5.0'
Executing 'SHOW SLAVE STATUS'
Executing 'SET SESSION max_statement_time = 0'
Executing 'SET SESSION max_statement_time = 5.0'
Executing 'STOP SLAVE'
Executing 'SET SESSION max_statement_time = 0'
Executing 'SET SESSION max_statement_time = 5.0'
Executing 'SHOW SLAVE STATUS'
Executing 'SET SESSION max_statement_time = 0'
Executing 'CHANGE MASTER TO MASTER_USE_GTID = slave_pos'
Executing 'SET SESSION max_statement_time = 5.0'
Executing 'SHOW SLAVE STATUS'
Executing 'SET SESSION max_statement_time = 0'
Executing 'SET SESSION max_statement_time = 5.0'
Executing 'START SLAVE'
Executing 'SET SESSION max_statement_time = 0'
Executing 'SET SESSION max_statement_time = 5.0'
Executing 'SHOW SLAVE STATUS'
Executing 'SET SESSION max_statement_time = 0'
Updating tendril...
Executing 'SET SESSION max_statement_time = 0'
Connected to db1115/tendril
Executing 'SELECT name FROM shards WHERE master_id = (SELECT id                    FROM servers                    WHERE host = 'db2107.codfw.wmnet' AND port = 3306)'
Executing 'UPDATE shards SET master_id = (SELECT id FROM servers WHERE host = 'db2104.codfw.wmnet' and port = 3306) WHERE name = 's2' LIMIT 1'
Tendril updated successfully: db2104/(none) is the new master of s2
Updating zarcillo...
Executing 'SET SESSION max_statement_time = 0'
Connected to db1115/zarcillo
Executing 'SELECT section, dc FROM masters WHERE instance = (SELECT name                   FROM instances                   WHERE server = 'db2107.codfw.wmnet' AND port = 3306)'
Executing 'SET STATEMENT binlog_format='ROW' FOR UPDATE masters SET instance = (SELECT name                 FROM instances                 WHERE server = 'db2104.codfw.wmnet' AND port = 3306)WHERE section = 's2' AND dc = 'codfw' LIMIT 1'
Zarcillo updated successfully: db2104/(none) is the new master of s2 at codfw
Please remember to run the following commands as root to update the events if they are Mediawiki databases:
mysql.py -h db2107 < /home/jynus/software/dbtools/events_coredb_slave.sql
mysql.py -h db2104 < /home/jynus/software/dbtools/events_coredb_master.sql

Looking at config differences between db2104 (new s2/codfw primary) and db2123 (s5/codfw primary), the semi-sync config looks suspiciously different:

$ sudo pt-config-diff h=db2104.codfw.wmnet,F=/root/.my.cnf h=db2123.codfw.wmnet,F=/root/.my.cnf
17 config differences
Variable                  db2104                    db2123
========================= ========================= =========================                                                          
general_log_file          db2104.log                db2123.log
gtid_binlog_pos           171966574-171966574-22... 171970593-171970593-29...                                                          
gtid_binlog_state         171966574-171966574-22... 171970593-171970593-29...                                                          
gtid_current_pos          0-180359173-4880477695... 0-180359179-5734605861...                                                          
gtid_domain_id            180355191                 180359180
gtid_slave_pos            0-180359173-4880477695... 0-180359179-5734605861...                                                          
hostname                  db2104                    db2123
log_bin_basename          /srv/sqldata/db2104-bin   /srv/sqldata/db2123-bin                                                            
log_bin_index             /srv/sqldata/db2104-bi... /srv/sqldata/db2123-bi...                                                          
pid_file                  /srv/sqldata/db2104.pid   /srv/sqldata/db2123.pid                                                            
report_host               db2104.codfw.wmnet        db2123.codfw.wmnet                                                                 
rpl_semi_sync_master_t... 10000                     100
rpl_semi_sync_master_w... ON                        OFF
server_id                 180355191                 180359180
slow_query_log_file       db2104-slow.log           db2123-slow.log
thread_pool_stall_limit   100                       10
wsrep_node_name           db2104                    db2123

The semi-sync vars that are different are:

  • rpl_semi_sync_master_timeout: "The timeout value, in milliseconds, for semi-synchronous replication in the primary. If this timeout is exceeded in waiting on a commit for acknowledgement from a replica, the primary will revert to asynchronous replication."
  • rpl_semi_sync_master_wait_no_slave: "If set to ON, the default, the replica count (recorded by Rpl_semi_sync_master_clients) may drop to zero, and the primary will still wait for the timeout period. If set to OFF, the primary will revert to asynchronous replication as soon as the replica count drops to zero."

That's an interesting find - though I don't see how that could've happened considering that both files are the same:

root@cumin1001:/home/marostegui/semi# diff -i db2123_my.cnf db2104_my.cnf
18,19c18,19
< server_id          = 180359180
< report_host        = db2123.codfw.wmnet
---
> server_id          = 180355191
> report_host        = db2104.codfw.wmnet
26c26
< gtid_domain_id  = 180359180
---
> gtid_domain_id  = 180355191

Checking the last eqiad master switch (s3 - db1157) against db2104, they do look the same:

root@cumin1001:/home/marostegui/semi# pt-config-diff h=db1157.eqiad.wmnet,F=/root/.my.cnf h=db2104.codfw.wmnet,F=/root/.my.cnf
16 config differences
Variable                  db1157                    db2104
========================= ========================= =========================
general_log_file          db1157.log                db2104.log
gtid_binlog_pos           171966508-171966508-42... 171966574-171966574-22...
gtid_binlog_state         171966508-171966508-42... 171966574-171966574-22...
gtid_current_pos          0-171966669-4075108480... 0-180359173-4880477695...
gtid_domain_id            171966508                 180355191
gtid_slave_pos            0-171966669-4075108480... 0-180359173-4880477695...
hostname                  db1157                    db2104
log_bin_basename          /srv/sqldata/db1157-bin   /srv/sqldata/db2104-bin
log_bin_index             /srv/sqldata/db1157-bi... /srv/sqldata/db2104-bi...
pid_file                  /srv/sqldata/db1157.pid   /srv/sqldata/db2104.pid
read_only                 ON                        OFF
report_host               db1157.eqiad.wmnet        db2104.codfw.wmnet
server_id                 171966508                 180355191
slow_query_log_file       db1157-slow.log           db2104-slow.log
wsrep_cluster_name        eqiad                     codfw
wsrep_node_name           db1157                    db2104

And even more interesting, db1157 (s3 eqiad) vs db2105 (s3 codfw) they are different:

# pt-config-diff h=db1157.eqiad.wmnet,F=/root/.my.cnf h=db2105.codfw.wmnet,F=/root/.my.cnf
19 config differences
Variable                  db1157                    db2105
========================= ========================= =========================
general_log_file          db1157.log                db2105.log
gtid_binlog_pos           171966508-171966508-42... 171966508-171966508-38...
gtid_binlog_state         171966508-171966508-42... 171966508-171966508-38...
gtid_current_pos          0-171966669-4075108480... 0-171966669-4075108480...
gtid_domain_id            171966508                 180355192
gtid_slave_pos            0-171966669-4075108480... 0-171966669-4075108480...
hostname                  db1157                    db2105
log_bin_basename          /srv/sqldata/db1157-bin   /srv/sqldata/db2105-bin
log_bin_index             /srv/sqldata/db1157-bi... /srv/sqldata/db2105-bi...
pid_file                  /srv/sqldata/db1157.pid   /srv/sqldata/db2105.pid
read_only                 ON                        OFF
report_host               db1157.eqiad.wmnet        db2105.codfw.wmnet
rpl_semi_sync_master_t... 10000                     100
rpl_semi_sync_master_w... ON                        OFF
server_id                 171966508                 180355192
slow_query_log_file       db1157-slow.log           db2105-slow.log
thread_pool_stall_limit   100                       10
wsrep_cluster_name        eqiad                     codfw
wsrep_node_name           db1157                    db2105

So I am curious to see if your s7 switchover has the same issue or not. If it is something strictly related to codfw (ie, latency, name resolution...) I would expect it to have issues as well.
Tomorrow I am switching m3, which is eqiad, which is not really comparable, as the number of replicas is just way lower...we'll see how that goes although those two are the same:

# pt-config-diff h=db1107.eqiad.wmnet,F=/root/.my.cnf h=db1132.eqiad.wmnet,F=/root/.my.cnf
18 config differences
Variable                  db1107                    db1132
========================= ========================= =========================
general_log_file          db1107.log                db1132.log
gtid_binlog_pos           171970595-171970595-65... 171966562-171966562-77...
gtid_binlog_state         171970595-171970595-65... 171966562-171966562-77...
gtid_domain_id            171966678                 171970595
gtid_slave_pos            0-171970592-200818911,... 0-171970592-200818911,...
hostname                  db1107                    db1132
innodb_buffer_pool_size   270582939648              269509197824
innodb_version            10.4.19                   10.4.18
log_bin_basename          /srv/sqldata/db1107-bin   /srv/sqldata/db1132-bin
log_bin_index             /srv/sqldata/db1107-bi... /srv/sqldata/db1132-bi...
pid_file                  /srv/sqldata/db1107.pid   /srv/sqldata/db1132.pid
read_only                 ON                        OFF
report_host               db1107.eqiad.wmnet        db1132.eqiad.wmnet
server_id                 171966678                 171970595
slow_query_log_file       db1107-slow.log           db1132-slow.log
version                   10.4.19-MariaDB-log       10.4.18-MariaDB-log
version_source_revision   a4139f8d68bd31e80ff620... 53123dfa3e365138591fd2...
wsrep_node_name           db1107                    db1132

Change 711489 had a related patch set uploaded (by Kormat; author: Kormat):

[operations/puppet@production] mariadb: If semi-sync is enabled, always config master settings.

https://gerrit.wikimedia.org/r/711489

New test plan:

  • Configure semi_sync settings on s2/eqiad candidate (db1162) to match those on db2104 (new s2/eqiad primary)
  • Move all replicas under db1162: sudo DEBUG=1 db-switchover --replicating-master --read-only-master --only-slave-move db1122.eqiad.wmnet db1162.eqiad.wmnet
  • Wait 30 mins, to simulate a real switchover
  • Monitor semi_sync status vars in the meantime: sudo mysql.py -h db1162 -e "show global variables like '%semi%'; show global status like '%semi%'"
  • Run db-switchover --skip-slave-moves against db1162, and see if it works or not: sudo DEBUG=1 db-switchover --skip-slave-move db1122.eqiad.wmnet db1162.eqiad.wmnet
  • Manually revert all of the above (😭 )
    • db-move-replica for each of the replicas
    • re-enable gtid on each replica
    • check semi_sync settings on replicas

New test plan:

Executed this, some output here: P17019

I wasn't able to reproduce the issue, sadly.

Change 711489 merged by Kormat:

[operations/puppet@production] mariadb: If semi-sync is enabled, always config master settings.

https://gerrit.wikimedia.org/r/711489

Buster: All role::mariadb::core instances now have the correct defaults for semi-sync master:

$ cumin "A:buster and P{O:mariadb::core}" "mysql -e \"show global variables where Variable_name in ('rpl_semi_sync_master_timeout', 'rpl_semi_sync_master_wait_no_slave')\""
116 hosts will be targeted:
...
===== NODE GROUP =====                                                                                                                  
(116) db[2071,2075,2081,2083-2084,2091-2092,2096,2104-2109,2111,2113-2117,2119-2125,2127,2129-2131,2136,2140,2142-2150,2152].codfw.wmnet,db[1100,1103,1110-1111,1114,1119-1120,1122-1123,1127,1129-1131,1134-1137,1141-1143,1147-1149,1151-1153,1157,1162,1164,1166,1168-1169,1172-1175,1177-1182,1184].eqiad.wmnet,es[2020-2034].codfw.wmnet,es[1020-1034].eqiad.wmnet
----- OUTPUT of 'mysql -e "show g...wait_no_slave')"' -----                                                                             
Variable_name   Value                                                                                                                   
rpl_semi_sync_master_timeout    100                                                                                                     
rpl_semi_sync_master_wait_no_slave      OFF
================                                                                                                                        

Stretch
Of the remaining 14 instances (in role::mariadb::core), only 7 of them have the semisync-master plugin loaded:

$ cumin "A:stretch and P{O:mariadb::core}" "mysql -e \"show global variables where Variable_name in ('rpl_semi_sync_master_timeout', 'rpl_semi_sync_master_wait_no_slave')\""
14 hosts will be targeted:
db[2079-2080,2090,2103,2110,2112,2118].codfw.wmnet,db[1104,1109,1118,1126,1138,1160,1163].eqiad.wmnet
Ok to proceed on 14 hosts? Enter the number of affected hosts to confirm or "q" to quit 14
===== NODE GROUP =====                                                                                                                  
(7) db[2079,2090,2112,2118].codfw.wmnet,db[1104,1138,1163].eqiad.wmnet                                                                  
----- OUTPUT of 'mysql -e "show g...wait_no_slave')"' -----                                                                             
Variable_name   Value                                                                                                                   
rpl_semi_sync_master_timeout    100                                                                                                     
rpl_semi_sync_master_wait_no_slave      OFF
================

I'm not going to load the plugin on the other 7 to change their defaults, as it doesn't seem worth the risk.

Current status:

  • This seems to be a mariadb bug, but as we can't reproduce it, it's ~impossible to track down.
  • I'm going to patch db-switchover so that it doesn't re-set semisync-master if it's already set. This should at least avoid the situation that triggers the bug.
  • Our current workflow leaves the primary in a situation without a working semisync setup in the interval between db-switchover --only-slave-moves and db-switchover --skip-slave-moves. This needs to be investigated, bearing in mind that in the past this was different, and suspected to be the cause of a switchover failover.

Change 713263 had a related patch set uploaded (by Kormat; author: Kormat):

[operations/software/wmfmariadbpy@master] db-switchover: Only configure semisync once

https://gerrit.wikimedia.org/r/713263

Change 713263 merged by jenkins-bot:

[operations/software/wmfmariadbpy@master] db-switchover: Only configure semisync once

https://gerrit.wikimedia.org/r/713263

The workaround is in wmfmariadbpy 0.7.2. I've built the new packages, will deploy them tomorrow.

Should we close this or maybe wait until Monday's switchover?

Let's wait until after monday's switchover.

Resolving. We've now done 2 switchovers with the mitigation in place, and haven't had any further issues.