Page MenuHomePhabricator

[toolsdb] Replication stopped because of invalid event
Closed, ResolvedPublic

Description

This happened twice in the last few weeks:

Oct 28 11:31:41 tools-db-2 mysqld[648831]: 2023-10-28 11:31:41 11 [ERROR] Read invalid event from master: 'Found invalid event in binary log', master could be corrupt but a more likely cause of this is a bug
Oct 28 11:31:41 tools-db-2 mysqld[648831]: 2023-10-28 11:31:41 11 [ERROR] Slave I/O: Relay log write failure: could not queue event from master, Internal MariaDB error code: 1595
Oct 28 11:31:41 tools-db-2 mysqld[648831]: 2023-10-28 11:31:41 11 [Note] Slave I/O thread exiting, read up to log 'log.043518', position 4; GTID position 0-2886731673-33522724637,2886731673-2886731673-4887243158,2886731301-2886731301-2985635060
Oct 28 11:31:41 tools-db-2 mysqld[648831]: 2023-10-28 11:31:41 11 [Note] master was tools-db-1.tools.eqiad1.wikimedia.cloud:3306
Nov 16 09:44:20 tools-db-2 mysqld[832013]: 2023-11-16  9:44:20 11 [ERROR] Read invalid event from master: 'Found invalid event in binary log', master could be corrupt but a more likely cause of this is a bug
Nov 16 09:44:20 tools-db-2 mysqld[832013]: 2023-11-16  9:44:20 11 [ERROR] Slave I/O: Relay log write failure: could not queue event from master, Internal MariaDB error code: 1595
Nov 16 09:44:20 tools-db-2 mysqld[832013]: 2023-11-16  9:44:20 11 [Note] Slave I/O thread exiting, read up to log 'log.046905', position 4; GTID position 0-2886731673-33522724637,2886731673-2886731673-4887243158,2886731301-2886731301-3282445549
Nov 16 09:44:20 tools-db-2 mysqld[832013]: 2023-11-16  9:44:20 11 [Note] master was tools-db-1.tools.eqiad1.wikimedia.cloud:3306

The first time I think it was linked to T349695: [toolsdb] MariaDB process is killed by OOM killer (October 2023) but maybe it isn't, the times do not coincide with a OOM crash of the primary.

In both cases, it was enough to run START SLAVE; to resume the replication.

Event Timeline

I have updated the runbook at https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolsDBReplication#If_the_replication_is_NOT_running with instrutions on how to restart the replication.

I will resolve the task for now, I opened it just for tracking the issue and we can reopen if it happens again.