Page MenuHomePhabricator

[toolsdb] clouddb1002 stopped replicating from clouddb1001
Closed, ResolvedPublic

Description

Replication stopped on Saturday 29 Oct:

Oct 29 16:20:18 clouddb1002 mysqld[27318]: 2022-10-29 16:20:18 139958015805184 [ERROR] Slave SQL: Could not execute Update_rows_v1 event on table s54518__mw.online; Can't find record in 'online', Error_code: 1032; handler error HA_ERR_KEY_NOT_FOUND; the event's master log log.301158, end_log_pos 20678372, Gtid 2886731673-2886731673-2226312553, Internal MariaDB error code: 1032
Oct 29 16:20:18 clouddb1002 mysqld[27318]: 2022-10-29 16:20:18 139958015805184 [Warning] Slave: Can't find record in 'online' Error_code: 1032
Oct 29 16:20:18 clouddb1002 mysqld[27318]: 2022-10-29 16:20:18 139958015805184 [ERROR] Error running query, slave SQL thread aborted. Fix the problem, and restart the slave SQL thread with "SLAVE START". We stopped at log 'log.301158' position 20678171; GTID position '0-2886731673-33522724637,2886731673-2886731673-2226312552'
Oct 29 16:20:18 clouddb1002 mysqld[27318]: 2022-10-29 16:20:18 139958015805184 [Note] Slave SQL thread exiting, replication stopped in log 'log.301158' at position 20678171; GTID position '0-2886731673-33522724637,2886731673-2886731673-2226312552'

A similar error happened two years ago, see T266587.

I did manually restart mariadb on that host a couple days before, but it seemed to resume correctly:

Oct 27 13:41:41 clouddb1002 systemd[1]: Stopping mariadb database server...
Oct 27 13:41:41 clouddb1002 mysqld[23602]: 2022-10-27 13:41:41 139804210677504 [Note] /opt/wmf-mariadb101/bin/mysqld: Normal shutdown
Oct 27 13:41:41 clouddb1002 mysqld[23602]: 2022-10-27 13:41:41 139804210677504 [Note] Event Scheduler: Killing the scheduler thread, thread id 2
Oct 27 13:41:41 clouddb1002 mysqld[23602]: 2022-10-27 13:41:41 139804210677504 [Note] Event Scheduler: Waiting for the scheduler thread to reply
Oct 27 13:41:41 clouddb1002 mysqld[23602]: 2022-10-27 13:41:41 139804210677504 [Note] Event Scheduler: Stopped
Oct 27 13:41:41 clouddb1002 mysqld[23602]: 2022-10-27 13:41:41 139804210677504 [Note] Event Scheduler: Purging the queue. 0 events
Oct 27 13:41:41 clouddb1002 mysqld[23602]: 2022-10-27 13:41:41 139804415751936 [Note] Slave SQL thread exiting, replication stopped in log 'log.300972' at position 42188486; GTID position '0-2886731673-33522724637,2886731673-2886731673-2193663302'
Oct 27 13:41:41 clouddb1002 mysqld[23602]: 2022-10-27 13:41:41 139804415956736 [Note] Slave I/O thread exiting, read up to log 'log.300972', position 42188486; GTID position 0-2886731673-33522724637,2886731673-2886731673-2193663302
Oct 27 13:41:42 clouddb1002 mysqld[23602]: 2022-10-27 13:41:42 139767312262912 [Note] InnoDB: FTS optimize thread exiting.
Oct 27 13:41:42 clouddb1002 mysqld[23602]: 2022-10-27 13:41:42 139804210677504 [Note] InnoDB: Starting shutdown...
Oct 27 13:41:42 clouddb1002 mysqld[23602]: 2022-10-27 13:41:42 139767287084800 [Warning] InnoDB: Dumping buffer pool(s) to .//ib_buffer_pool
Oct 27 13:41:43 clouddb1002 mysqld[23602]: 2022-10-27 13:41:43 139767287084800 [Warning] InnoDB: Buffer pool(s) dump completed at 221027 13:41:43
Oct 27 13:41:43 clouddb1002 mysqld[23602]: 2022-10-27 13:41:43 139804210677504 [Note] InnoDB: Waiting for page_cleaner to finish flushing of buffer pool
Oct 27 13:41:50 clouddb1002 mysqld[23602]: 2022-10-27 13:41:50 139804210677504 [Note] InnoDB: Shutdown completed; log sequence number 70416120485936
Oct 27 13:41:50 clouddb1002 mysqld[23602]: 2022-10-27 13:41:50 139804210677504 [Note] /opt/wmf-mariadb101/bin/mysqld: Shutdown complete
Oct 27 13:41:50 clouddb1002 systemd[1]: Stopped mariadb database server.
Oct 27 13:42:53 clouddb1002 systemd[1]: Starting mariadb database server...
Oct 27 13:42:53 clouddb1002 mysqld[27318]: 2022-10-27 13:42:53 139958272529088 [Note] /opt/wmf-mariadb101/bin/mysqld (mysqld 10.1.39-MariaDB) starting as process 27318 ...
Oct 27 13:42:53 clouddb1002 mysqld[27318]: 2022-10-27 13:42:53 139958272529088 [ERROR] Plugin 'unix_socket' already installed
Oct 27 13:42:53 clouddb1002 mysqld[27318]: 2022-10-27 13:42:53 139958272529088 [Note] InnoDB: Using mutexes to ref count buffer pool pages
Oct 27 13:42:53 clouddb1002 mysqld[27318]: 2022-10-27 13:42:53 139958272529088 [Note] InnoDB: The InnoDB memory heap is disabled
Oct 27 13:42:53 clouddb1002 mysqld[27318]: 2022-10-27 13:42:53 139958272529088 [Note] InnoDB: Mutexes and rw_locks use GCC atomic builtins
Oct 27 13:42:53 clouddb1002 mysqld[27318]: 2022-10-27 13:42:53 139958272529088 [Note] InnoDB: GCC builtin __atomic_thread_fence() is used for memory barrier
Oct 27 13:42:53 clouddb1002 mysqld[27318]: 2022-10-27 13:42:53 139958272529088 [Note] InnoDB: Compressed tables use zlib 1.2.11
Oct 27 13:42:53 clouddb1002 mysqld[27318]: 2022-10-27 13:42:53 139958272529088 [Note] InnoDB: Using Linux native AIO
Oct 27 13:42:53 clouddb1002 mysqld[27318]: 2022-10-27 13:42:53 139958272529088 [Note] InnoDB: Using SSE crc32 instructions
Oct 27 13:42:53 clouddb1002 mysqld[27318]: 2022-10-27 13:42:53 139958272529088 [Note] InnoDB: Initializing buffer pool, size = 31.0G
Oct 27 13:42:55 clouddb1002 mysqld[27318]: 2022-10-27 13:42:55 139958272529088 [Note] InnoDB: Completed initialization of buffer pool
Oct 27 13:42:55 clouddb1002 mysqld[27318]: 2022-10-27 13:42:55 139958272529088 [Note] InnoDB: Highest supported file format is Barracuda.
Oct 27 13:42:57 clouddb1002 mysqld[27318]: 2022-10-27 13:42:57 139958272529088 [Note] InnoDB: 128 rollback segment(s) are active.
Oct 27 13:42:57 clouddb1002 mysqld[27318]: 2022-10-27 13:42:57 139958272529088 [Note] InnoDB: Waiting for purge to start
Oct 27 13:42:57 clouddb1002 mysqld[27318]: 2022-10-27 13:42:57 139958272529088 [Note] InnoDB:  Percona XtraDB (http://www.percona.com) 5.6.43-84.3 started; log sequence number 70416120485936
Oct 27 13:42:57 clouddb1002 mysqld[27318]: 2022-10-27 13:42:57 139921096406784 [Note] InnoDB: Dumping buffer pool(s) not yet started
Oct 27 13:42:57 clouddb1002 mysqld[27318]: 2022-10-27 13:42:57 7f41eb3f7700 InnoDB: Loading buffer pool(s) from .//ib_buffer_pool
Oct 27 13:42:57 clouddb1002 mysqld[27318]: 2022-10-27 13:42:57 139958272529088 [Note] Server socket created on IP: '::'.
Oct 27 13:42:57 clouddb1002 mysqld[27318]: 2022-10-27 13:42:57 139958272529088 [Warning] 'proxies_priv' entry '@% root@labsdb1005' ignored in --skip-name-resolve mode.
Oct 27 13:42:57 clouddb1002 mysqld[27318]: 2022-10-27 13:42:57 139958220961536 [Note] Event Scheduler: scheduler thread started with id 2
Oct 27 13:42:57 clouddb1002 mysqld[27318]: 2022-10-27 13:42:57 139958272529088 [Warning] Neither --relay-log nor --relay-log-index were used; so replication may break when this MySQL server acts as a slave and has his hostname changed!! Please use '--log -basename=#' or '--relay-log=clouddb1002-relay-bin' to avoid this problem.
Oct 27 13:42:58 clouddb1002 mysqld[27318]: 2022-10-27 13:42:58 139958272529088 [Note] /opt/wmf-mariadb101/bin/mysqld: ready for connections.
Oct 27 13:42:58 clouddb1002 mysqld[27318]: Version: '10.1.39-MariaDB'  socket: '/var/run/mysqld/mysqld.sock'  port: 3306  MariaDB Server
Oct 27 13:42:58 clouddb1002 systemd[1]: Started mariadb database server.
Oct 27 13:42:58 clouddb1002 mysqld[27318]: 2022-10-27 13:42:58 139958220551936 [Note] Slave I/O thread: connected to master 'repl@clouddb1001.clouddb-services.eqiad1.wikimedia.cloud:3306',replication starts at GTID position '0-2886731673-33522724637,2886731673-2886731673-2193663302'

Event Timeline

fnegri changed the task status from Open to In Progress.Oct 31 2022, 3:28 PM
fnegri triaged this task as High priority.

I have run manually these commands on clouddb1002 and replication has resumed, though there is a big lag at the moment:

MariaDB [(none)]> set global Replicate_Wild_Ignore_Table='s51412\_\_data.%,s51071\_\_templatetiger\_p.%,s52721\_\_pagecount\_stats\_p.%,s51290\_\_dpl\_p.%,s54518__mw.online';
MariaDB [(none)]> START SLAVE;
MariaDB [(none)]> show slave status\G
[...]     
Seconds_Behind_Master: 170412

Replication lag is slowly but steadily decreasing.

Seconds_Behind_Master: 155790

Screenshot 2022-10-31 at 18.13.23.png (1×1 px, 128 KB)

Replication is back to normal.

Screenshot 2022-11-02 at 10.05.51.png (610×1 px, 94 KB)

I have restarted MariaDB (systemctl restart mariadb) on clouddb1002 to let it pick up https://gerrit.wikimedia.org/r/c/operations/puppet/+/810420 -- that patch adds s54518\_\_mw.% to replicate_wild_ignore_table through the config file, which is similar to what I did manually from the MariaDB shell.

I also updated the list of non-replicated tables in this wiki page.