Page MenuHomePhabricator

toolsdb: full disk on clouddb1001 broke clouddb1002 (secondary) replication
Closed, ResolvedPublic

Description

ToolsDB replication to clouddb1002.clouddb-services.eqiad1.wikimedia.cloud broke on 2022-02-15:

MariaDB [(none)]> show slave status\G
*************************** 1. row ***************************
[...]
                Last_IO_Errno: 1236
                Last_IO_Error: Got fatal error 1236 from master when reading data from binary log: 'binlog truncated in the middle of event; consider out of disk space on master; the first event 'log.275246' at 15499031, the last event read from 'log.275246' at 15499142, the last byte read from 'log.275246' at 15499161.'
[...]

and clouddb1001 indeed somehow went out of disk space:

image.png (670×1 px, 82 KB)

Related Objects

Event Timeline

taavi triaged this task as Unbreak Now! priority.Feb 17 2022, 9:24 AM
taavi created this task.
dcaro changed the task status from Open to In Progress.Feb 17 2022, 9:32 AM
dcaro claimed this task.
dcaro added a project: User-dcaro.
dcaro moved this task from To refine to Doing on the User-dcaro board.

Yep, from the logs:

Feb 15 10:28:30 clouddb1001 mysqld[2018]: 2022-02-15 10:28:30 140283553928960 [Warning] mysqld: Disk is full writing '/srv/labsdb/tmp/#sql_7e2_9.MAD' (Errcode: 28 "No space left on device"). Waiting for someone to free space... (Expect up to 60 secs delay for server to continue after freeing disk space)

According to the graphs there's something happening at around 03:39 machine time that triggers the disk spike, and crashes at 10:28:30, from the journal log I don't see anything suspicious at the start time.

According to the graphs there's something happening at around 03:39 machine time that triggers the disk spike, and crashes at 10:28:30, from the journal log I don't see anything suspicious at the start time.

I see a spike of writes at roughly the same time:

image.png (772×1 px, 158 KB)

Not sure if we can be sure what tool is causing that, we have the processlist exporter enabled for prometheus_mysqld_exporter but I don't see the metrics on Prometheus. Maybe it's missing some grants:

MariaDB [(none)]> show grants for 'prometheus'@'localhost'\G
*************************** 1. row ***************************
Grants for prometheus@localhost: GRANT PROCESS, REPLICATION CLIENT ON *.* TO 'prometheus'@'localhost' IDENTIFIED VIA unix_socket WITH MAX_USER_CONNECTIONS 5
*************************** 2. row ***************************
Grants for prometheus@localhost: GRANT SELECT ON `heartbeat`.`heartbeat` TO 'prometheus'@'localhost'
2 rows in set (0.00 sec)

but the docs recommend granting SELECT globally

aborrero renamed this task from clouddb1002 (toolsdb secondary) replication broken to toolsdb: full disk on cloudddb1001 broke clouddb1002 (secondary) replication.Feb 17 2022, 11:16 AM

For now skipped the broken binlog jumping to the next working one:

# found the next binlog/position to start from
# was stuck at log.275246 event 15499142
# next event is log.275247 position 4 (binlogs start at 4)
# You can use mysqlbinlog to verify
# on the slave
sudo mysql
mysql> STOP SLAVE;
# verify that it stopped
mysql> SHOW SLAVE STATUS\G
# check Slave_IO_Running and Slave_SQL_Running should be no
mysql> CHANGE MASTER TO MASTER_LOG_FILE='log.275247', MASTER_LOG_POS=4;
mysql> START SLAVE;
# Then see that the slave picks up
mysql> SHOW SLAVE STATUS\G
# check Slave_IO_Running and Slave_SQL_Running should be yes
# and Seconds_Behind_Master should be going down bit by bit
aborrero lowered the priority of this task from Unbreak Now! to High.Feb 17 2022, 2:27 PM

The replica is still catching up, will leave it for the time being, will close the task once it's back in sync with the master.

The replication is cought up \o/

Will close this task, and follow up on the subtasks.

Andrew renamed this task from toolsdb: full disk on cloudddb1001 broke clouddb1002 (secondary) replication to toolsdb: full disk on clouddb1001 broke clouddb1002 (secondary) replication.Feb 23 2022, 3:38 PM