Page MenuHomePhabricator

[toolsdb] Replication stopped because of invalid event
Closed, ResolvedPublic

Description

This happened twice in the last few weeks:

Oct 28 11:31:41 tools-db-2 mysqld[648831]: 2023-10-28 11:31:41 11 [ERROR] Read invalid event from master: 'Found invalid event in binary log', master could be corrupt but a more likely cause of this is a bug
Oct 28 11:31:41 tools-db-2 mysqld[648831]: 2023-10-28 11:31:41 11 [ERROR] Slave I/O: Relay log write failure: could not queue event from master, Internal MariaDB error code: 1595
Oct 28 11:31:41 tools-db-2 mysqld[648831]: 2023-10-28 11:31:41 11 [Note] Slave I/O thread exiting, read up to log 'log.043518', position 4; GTID position 0-2886731673-33522724637,2886731673-2886731673-4887243158,2886731301-2886731301-2985635060
Oct 28 11:31:41 tools-db-2 mysqld[648831]: 2023-10-28 11:31:41 11 [Note] master was tools-db-1.tools.eqiad1.wikimedia.cloud:3306
Nov 16 09:44:20 tools-db-2 mysqld[832013]: 2023-11-16  9:44:20 11 [ERROR] Read invalid event from master: 'Found invalid event in binary log', master could be corrupt but a more likely cause of this is a bug
Nov 16 09:44:20 tools-db-2 mysqld[832013]: 2023-11-16  9:44:20 11 [ERROR] Slave I/O: Relay log write failure: could not queue event from master, Internal MariaDB error code: 1595
Nov 16 09:44:20 tools-db-2 mysqld[832013]: 2023-11-16  9:44:20 11 [Note] Slave I/O thread exiting, read up to log 'log.046905', position 4; GTID position 0-2886731673-33522724637,2886731673-2886731673-4887243158,2886731301-2886731301-3282445549
Nov 16 09:44:20 tools-db-2 mysqld[832013]: 2023-11-16  9:44:20 11 [Note] master was tools-db-1.tools.eqiad1.wikimedia.cloud:3306

The first time I think it was linked to T349695: [toolsdb] MariaDB process is killed by OOM killer (October 2023) but maybe it isn't, the times do not coincide with a OOM crash of the primary.

In both cases, it was enough to run START SLAVE; to resume the replication.

Event Timeline

I have updated the runbook at https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolsDBReplication#If_the_replication_is_NOT_running with instrutions on how to restart the replication.

I will resolve the task for now, I opened it just for tracking the issue and we can reopen if it happens again.

This happened again yesterday. Similar to the previous occurrences, START SLAVE; was enough to resume replication.

Jun 16 14:42:54 tools-db-3 mysqld[3868879]: 2024-06-16 14:42:54 11 [ERROR] Read invalid event from master: 'Found invalid event in binary log', master could be corrupt but a more likely cause of this is a bug
Jun 16 14:42:54 tools-db-3 mysqld[3868879]: 2024-06-16 14:42:54 11 [ERROR] Slave I/O: Relay log write failure: could not queue event from master, Internal MariaDB error code: 1595
Jun 16 14:42:54 tools-db-3 mysqld[3868879]: 2024-06-16 14:42:54 11 [Note] Slave I/O thread exiting, read up to log 'log.095923', position 4; GTID position 0-2886731673-33522724637,2886731673-2886731673-4887243158,2886731301-2886731301-6592285327
Jun 16 14:42:54 tools-db-3 mysqld[3868879]: 2024-06-16 14:42:54 11 [Note] master was tools-db-1.tools.eqiad1.wikimedia.cloud:3306

This happened again today. START REPLICA; was enough to resume replication.

Oct 16 09:50:45 tools-db-6 mysqld[936]: 2025-10-16  9:50:45 6381940 [ERROR] Read invalid event from master: 'Found invalid event in binary log', master could be corrupt but a more likely cause of this is a bug
Oct 16 09:50:45 tools-db-6 mysqld[936]: 2025-10-16  9:50:45 6381940 [ERROR] Slave I/O: Relay log write failure: could not queue event from master, Internal MariaDB error code: 1595
Oct 16 09:50:45 tools-db-6 mysqld[936]: 2025-10-16  9:50:45 6381940 [Note] Slave I/O thread exiting, read up to log 'log.089122', position 4; GTID position 2886729896-2886729896-6219950854, master tools-db-4.tools.eqiad1.wikimedia.cloud:3306

After START REPLICA;:

Oct 16 10:08:28 tools-db-6 mysqld[936]: 2025-10-16 10:08:28 31874903 [Note] Slave I/O thread: Start asynchronous replication to master 'repl@tools-db-4.tools.eqiad1.wikimedia.cloud:3306' in log 'log.089122' at position 4
Oct 16 10:08:28 tools-db-6 mysqld[936]: 2025-10-16 10:08:28 31874903 [Note] Slave I/O thread: connected to master 'repl@tools-db-4.tools.eqiad1.wikimedia.cloud:3306',replication starts at GTID position '2886729896-2886729896-6219950854'