While attempting to failover db2107 to db2104 (T287454) the script got stuck and had to crtl+c after a couple of minutes (we were on read-only already), this is the output:
root@cumin2001:~# db-switchover --skip-slave-move db2107.codfw.wmnet db2104.codfw.wmnet Starting preflight checks... * Original read only values are as expected (master: read_only=False, slave: read_only=True) * The host to fail over is a direct replica of the master * Replication is up and running between the 2 hosts * Binary log format is the same: STATEMENT * The replication lag is acceptable: 0 (lower than the configured or default timeout) ^CTraceback (most recent call last): File "/usr/bin/db-switchover", line 11, in <module> load_entry_point('wmfmariadbpy==0.7.1', 'console_scripts', 'db-switchover')() File "/usr/lib/python3/dist-packages/wmfmariadbpy/cli_admin/switchover.py", line 740, in main handle_new_master_semisync_replication(slave) File "/usr/lib/python3/dist-packages/wmfmariadbpy/cli_admin/switchover.py", line 665, in handle_new_master_semisync_replication result = slave.execute("SET GLOBAL rpl_semi_sync_master_enabled = 1") File "/usr/lib/python3/dist-packages/wmfmariadbpy/WMFMariaDB.py", line 199, in execute cursor.execute(command) File "/usr/lib/python3/dist-packages/pymysql/cursors.py", line 170, in execute result = self._query(query) File "/usr/lib/python3/dist-packages/pymysql/cursors.py", line 328, in _query conn.query(q) File "/usr/lib/python3/dist-packages/pymysql/connections.py", line 517, in query self._affected_rows = self._read_query_result(unbuffered=unbuffered) File "/usr/lib/python3/dist-packages/pymysql/connections.py", line 732, in _read_query_result result.read() File "/usr/lib/python3/dist-packages/pymysql/connections.py", line 1075, in read first_packet = self.connection._read_packet() File "/usr/lib/python3/dist-packages/pymysql/connections.py", line 657, in _read_packet packet_header = self._read_bytes(4) File "/usr/lib/python3/dist-packages/pymysql/connections.py", line 691, in _read_bytes data = self._rfile.read(num_bytes) File "/usr/lib/python3.7/socket.py", line 589, in readinto return self._sock.recv_into(b) File "/usr/lib/python3.7/ssl.py", line 1052, in recv_into return self.read(nbytes, buffer) File "/usr/lib/python3.7/ssl.py", line 911, in read return self._sslobj.read(len, buffer) KeyboardInterrupt
It failed from either cumin1001 and cumin2001 with same issues. Same also with and without FQDN.
It's been years since we ran this script against codfw primary masters, so there might be something there. My guess is that it got stuck while trying to connect or when trying to run a command on the mysql hosts, but I couldn't find any on show processlist of the involved hosts. I didn't have much time to debug as writes were blocked already for around 5 minutes.
db-switchover --timeout=15 --only-slave-move db2107.codfw.wmnet db2104.codfw.wmnet worked without issues.