[toolsdb] Replication stopped because of invalid event
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	fnegri
	Nov 16 2023, 7:40 PM

Description

This happened twice in the last few weeks:

Oct 28 11:31:41 tools-db-2 mysqld[648831]: 2023-10-28 11:31:41 11 [ERROR] Read invalid event from master: 'Found invalid event in binary log', master could be corrupt but a more likely cause of this is a bug
Oct 28 11:31:41 tools-db-2 mysqld[648831]: 2023-10-28 11:31:41 11 [ERROR] Slave I/O: Relay log write failure: could not queue event from master, Internal MariaDB error code: 1595
Oct 28 11:31:41 tools-db-2 mysqld[648831]: 2023-10-28 11:31:41 11 [Note] Slave I/O thread exiting, read up to log 'log.043518', position 4; GTID position 0-2886731673-33522724637,2886731673-2886731673-4887243158,2886731301-2886731301-2985635060
Oct 28 11:31:41 tools-db-2 mysqld[648831]: 2023-10-28 11:31:41 11 [Note] master was tools-db-1.tools.eqiad1.wikimedia.cloud:3306

Nov 16 09:44:20 tools-db-2 mysqld[832013]: 2023-11-16  9:44:20 11 [ERROR] Read invalid event from master: 'Found invalid event in binary log', master could be corrupt but a more likely cause of this is a bug
Nov 16 09:44:20 tools-db-2 mysqld[832013]: 2023-11-16  9:44:20 11 [ERROR] Slave I/O: Relay log write failure: could not queue event from master, Internal MariaDB error code: 1595
Nov 16 09:44:20 tools-db-2 mysqld[832013]: 2023-11-16  9:44:20 11 [Note] Slave I/O thread exiting, read up to log 'log.046905', position 4; GTID position 0-2886731673-33522724637,2886731673-2886731673-4887243158,2886731301-2886731301-3282445549
Nov 16 09:44:20 tools-db-2 mysqld[832013]: 2023-11-16  9:44:20 11 [Note] master was tools-db-1.tools.eqiad1.wikimedia.cloud:3306

The first time I think it was linked to T349695: [toolsdb] MariaDB process is killed by OOM killer (October 2023) but maybe it isn't, the times do not coincide with a OOM crash of the primary.

In both cases, it was enough to run START SLAVE; to resume the replication.

Related Objects

Mentioned In: T357624: [toolsdb] Replica is frequently lagging behind the primary
Mentioned Here: T349695: [toolsdb] MariaDB process is killed by OOM killer (October 2023)

Event Timeline

fnegri created this task.Nov 16 2023, 7:40 PM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptNov 16 2023, 7:40 PM

I have updated the runbook at https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolsDBReplication#If_the_replication_is_NOT_running with instrutions on how to restart the replication.

I will resolve the task for now, I opened it just for tracking the issue and we can reopen if it happens again.

fnegri moved this task from Backlog to Done on the cloud-services-team (FY2023/2024-Q1-Q2) board.Nov 16 2023, 7:52 PM

fnegri mentioned this in T357624: [toolsdb] Replica is frequently lagging behind the primary.Feb 15 2024, 12:28 PM

[toolsdb] Replication stopped because of invalid eventClosed, ResolvedPublicActions

Description

Related Objects

Event Timeline

[toolsdb] Replication stopped because of invalid event
Closed, ResolvedPublic
Actions