Today 2020-05-27 the toolsdb master server crashed.
The master server clouddb1001.clouddb-services.eqiad.wmflabs was totally frozen. After we checked the replica was healthy clouddb1002.clouddb-services.eqiad.wmflabs we decided to hard-reboot the master.
After the reboot, we were able to collect some logs indicating memory exhaustion:
May 27 05:47:14 clouddb1001 kernel: [18638293.176610] mysqld: page allocation stalls for 644076ms, order:0, mode:0x342004a(GFP_NOFS|__GFP_HIGHMEM|__GFP_HARDWALL|__GFP_MOVABLE|__GFP_WRITE) May 27 05:47:14 clouddb1001 kernel: [18638293.176618] CPU: 1 PID: 8567 Comm: mysqld Not tainted 4.9.0-8-amd64 #1 Debian 4.9.144-3.1 May 27 05:47:14 clouddb1001 kernel: [18638293.176619] Hardware name: OpenStack Foundation OpenStack Nova, BIOS 1.10.2-1 04/01/2014 May 27 05:47:14 clouddb1001 kernel: [18638293.176620] 0000000000000000 ffffffffb9534524 ffffffffb9c01e18 ffff9f6706f5fa08 May 27 05:47:14 clouddb1001 kernel: [18638293.176623] ffffffffb938bfaa 0342004a00000000 ffffffffb9c01e18 ffff9f6706f5f9a8 May 27 05:47:14 clouddb1001 kernel: [18638293.176625] ffffffff00000010 ffff9f6706f5fa18 ffff9f6706f5f9c8 fee40a0c5d22bfa4 May 27 05:47:14 clouddb1001 kernel: [18638293.176627] Call Trace: May 27 05:47:14 clouddb1001 kernel: [18638293.176636] [<ffffffffb9534524>] ? dump_stack+0x5c/0x78 May 27 05:47:14 clouddb1001 kernel: [18638293.176641] [<ffffffffb938bfaa>] ? warn_alloc+0x13a/0x160 May 27 05:47:14 clouddb1001 kernel: [18638293.176643] [<ffffffffb938c99d>] ? __alloc_pages_slowpath+0x95d/0xbb0 May 27 05:47:14 clouddb1001 kernel: [18638293.176645] [<ffffffffb938cdf1>] ? __alloc_pages_nodemask+0x201/0x260 May 27 05:47:14 clouddb1001 kernel: [18638293.176648] [<ffffffffb93de751>] ? alloc_pages_current+0x91/0x140 May 27 05:47:14 clouddb1001 kernel: [18638293.176649] [<ffffffffb9383684>] ? pagecache_get_page+0xb4/0x2a0 May 27 05:47:14 clouddb1001 kernel: [18638293.176650] [<ffffffffb938388c>] ? grab_cache_page_write_begin+0x1c/0x40 May 27 05:47:14 clouddb1001 kernel: [18638293.176653] [<ffffffffb94706f5>] ? iomap_write_begin+0x55/0x120 May 27 05:47:14 clouddb1001 kernel: [18638293.176654] [<ffffffffb9470988>] ? iomap_write_actor+0xa8/0x180 May 27 05:47:14 clouddb1001 kernel: [18638293.176657] [<ffffffffb94708e0>] ? iomap_write_end+0x70/0x70 May 27 05:47:14 clouddb1001 kernel: [18638293.176659] [<ffffffffb9470fed>] ? iomap_apply+0xbd/0x140 May 27 05:47:14 clouddb1001 kernel: [18638293.176660] [<ffffffffb94708e0>] ? iomap_write_end+0x70/0x70 May 27 05:47:14 clouddb1001 kernel: [18638293.176661] [<ffffffffb94710d7>] ? iomap_file_buffered_write+0x67/0x90 May 27 05:47:14 clouddb1001 kernel: [18638293.176662] [<ffffffffb94708e0>] ? iomap_write_end+0x70/0x70 May 27 05:47:14 clouddb1001 kernel: [18638293.176716] [<ffffffffc060a4a4>] ? xfs_file_buffered_aio_write+0xe4/0x2f0 [xfs] May 27 05:47:14 clouddb1001 kernel: [18638293.176736] [<ffffffffc06179b7>] ? xfs_iunlock+0xe7/0x100 [xfs] May 27 05:47:14 clouddb1001 kernel: [18638293.176754] [<ffffffffc060a7bf>] ? xfs_file_write_iter+0x10f/0x160 [xfs] May 27 05:47:14 clouddb1001 kernel: [18638293.176757] [<ffffffffb940ac30>] ? new_sync_write+0xe0/0x130 May 27 05:47:14 clouddb1001 kernel: [18638293.176758] [<ffffffffb940b3b0>] ? vfs_write+0xb0/0x190 May 27 05:47:14 clouddb1001 kernel: [18638293.176760] [<ffffffffb940c7f2>] ? SyS_write+0x52/0xc0 May 27 05:47:14 clouddb1001 kernel: [18638293.176762] [<ffffffffb9203b7d>] ? do_syscall_64+0x8d/0xf0 May 27 05:47:14 clouddb1001 kernel: [18638293.176766] [<ffffffffb981924e>] ? entry_SYSCALL_64_after_swapgs+0x58/0xc6
Upon initialization of the master, @Marostegui fixed some inconsistencies on some of the master tables, but replication to the backup is now showing inconsistencies in some tables as well.
May 27 05:57:30 clouddb1002 mysqld[14210]: 2020-05-27 5:57:30 140124118230784 [ERROR] Error reading packet from server: Lost connection to MySQL server during query (server_errno=2013) May 27 05:57:30 clouddb1002 mysqld[14210]: 2020-05-27 5:57:30 140124118230784 [Note] Slave I/O thread: Failed reading log event, reconnecting to retry, log 'log.209037' at position 83547720; GTID position '0-2886731673-23684796837' May 27 05:57:30 clouddb1002 mysqld[14210]: 2020-05-27 5:57:30 140124118230784 [ERROR] Slave I/O: error reconnecting to master 'repl@172.16.7.153:3306' - retry-time: 60 maximum-retries: 86400 message: Can't connect to MySQL server on '172.16.7.153' (111 "Connection refused"), Internal MariaDB error code: 2003 May 27 06:10:00 clouddb1002 mysqld[14210]: 2020-05-27 6:10:00 140124118030080 [ERROR] Slave SQL: Could not execute Write_rows_v1 event on table s53685__editgroups.store_edit; Duplicate entry '1230218293' for key 'PRIMARY', Error_code: 1062; handler error HA_ERR_FOUND_DUPP_KEY; the event's master log log.209041, end_log_pos 3041397, Gtid 0-2886731673-23684799152, Internal MariaDB error code: 1062 May 27 06:10:00 clouddb1002 mysqld[14210]: 2020-05-27 6:10:00 140124118030080 [Warning] Slave: Duplicate entry '1230218293' for key 'PRIMARY' Error_code: 1062 May 27 06:10:00 clouddb1002 mysqld[14210]: 2020-05-27 6:10:00 140124118030080 [ERROR] Error running query, slave SQL thread aborted. Fix the problem, and restart the slave SQL thread with "SLAVE START". We stopped at log 'log.209041' position 3040679; GTID position '0-2886731673-23684799151' May 27 06:10:00 clouddb1002 mysqld[14210]: 2020-05-27 6:10:00 140124118030080 [Note] Slave SQL thread exiting, replication stopped in log 'log.209041' at position 3040679; GTID position '0-2886731673-23684799151'
Most of the data consistency issues on both servers were already fixed by @Marostegui