Maniphest T301951

toolsdb: full disk on clouddb1001 broke clouddb1002 (secondary) replication
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	taavi
	Feb 17 2022, 9:24 AM

Description

ToolsDB replication to clouddb1002.clouddb-services.eqiad1.wikimedia.cloud broke on 2022-02-15:

MariaDB [(none)]> show slave status\G
*************************** 1. row ***************************
[...]
                Last_IO_Errno: 1236
                Last_IO_Error: Got fatal error 1236 from master when reading data from binary log: 'binlog truncated in the middle of event; consider out of disk space on master; the first event 'log.275246' at 15499031, the last event read from 'log.275246' at 15499142, the last byte read from 'log.275246' at 15499161.'
[...]

and clouddb1001 indeed somehow went out of disk space:

Related Objects
Search...

Status	Assigned	Task
Resolved	dcaro	T301951 toolsdb: full disk on clouddb1001 broke clouddb1002 (secondary) replication
Open	None	T301967 toolsdb: evaluate storage usage by some tools
Open	fnegri	T291782 Migrate largest ToolsDB users to Trove
Resolved	Andrew	T292546 cloud NFS: figure out backups for cinder volumes
Resolved	aborrero	T293752 cloud ceph: refactor rbd client puppet profiles
Duplicate	None	T294429 cinder-backups: figure out automation
Resolved	aborrero	T295584 eqiad: 2 VMs for cloudbackup-dev
Resolved	aborrero	T296413 cinder: get victoria point release in the bpo repo
Resolved	aborrero	T299708 network access to eqiad ceph cluster from cloudbackup2002
Resolved	Andrew	T339830 cinder-backup getting OOM-killed for large volumes
Resolved	Andrew	T344065 Replace cinder-backup process with backy2
Resolved	Andrew	T358855 Use cloudbackup100[12]-dev for cinder backup test/dev
Resolved	Andrew	T366071 'backy2 cleanup' not getting called properly on cloudbackup hosts
Resolved	Andrew	T323502 Move some of magnus's tools to Trove databases (was: Request increased quota for mix-n-match Toolforge tool)
Resolved	Andrew	T324984 Trove volume size limit of 31Gb
Resolved	fnegri	T326754 Clarify Trove and Toolsdb usage within WMCS
Resolved	TBurmeister	T326854 Publish revised Trove user guide
Resolved	fnegri	T328691 [toolsdb] Migrate linkwatcher db to Trove
Resolved	TheresNoTime	T334491 Update Linkwatcher tool to use new Trove database
Resolved	fnegri	T328693 [toolsdb] Migrate "s54518__mw" db to Trove
Open	fnegri	T350862 [toolsdb] Migrate mixnmatch db to Trove
Open	fnegri	T369177 [toolsdb] Migrate quickstatements db to Trove
Resolved	taavi	T301993 [toolsdb] Enable gtid to help replication recovery
Resolved	fnegri	T301994 [toolsdb] Add replication alerting

Event Timeline

taavi triaged this task as Unbreak Now! priority.Feb 17 2022, 9:24 AM

taavi created this task.

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptFeb 17 2022, 9:24 AM

dcaro changed the task status from Open to In Progress.Feb 17 2022, 9:32 AM

dcaro claimed this task.

dcaro added a project: User-dcaro.

dcaro moved this task from To refine to Doing on the User-dcaro board.

dcaro added a project: Cloud-Services-Worktype-Unplanned.Feb 17 2022, 9:35 AM

dcaro added a project: Cloud-Services-Origin-Alert.

dcaro removed a project: Cloud-Services-Origin-Alert.

dcaro added a project: Cloud-Services-Origin-User.

Yep, from the logs:

Feb 15 10:28:30 clouddb1001 mysqld[2018]: 2022-02-15 10:28:30 140283553928960 [Warning] mysqld: Disk is full writing '/srv/labsdb/tmp/#sql_7e2_9.MAD' (Errcode: 28 "No space left on device"). Waiting for someone to free space... (Expect up to 60 secs delay for server to continue after freeing disk space)

According to the graphs there's something happening at around 03:39 machine time that triggers the disk spike, and crashes at 10:28:30, from the journal log I don't see anything suspicious at the start time.

taavi updated the task description. (Show Details)Feb 17 2022, 9:52 AM

In T301951#7717472, @dcaro wrote:

According to the graphs there's something happening at around 03:39 machine time that triggers the disk spike, and crashes at 10:28:30, from the journal log I don't see anything suspicious at the start time.

I see a spike of writes at roughly the same time:

Not sure if we can be sure what tool is causing that, we have the processlist exporter enabled for prometheus_mysqld_exporter but I don't see the metrics on Prometheus. Maybe it's missing some grants:

MariaDB [(none)]> show grants for 'prometheus'@'localhost'\G
*************************** 1. row ***************************
Grants for prometheus@localhost: GRANT PROCESS, REPLICATION CLIENT ON *.* TO 'prometheus'@'localhost' IDENTIFIED VIA unix_socket WITH MAX_USER_CONNECTIONS 5
*************************** 2. row ***************************
Grants for prometheus@localhost: GRANT SELECT ON `heartbeat`.`heartbeat` TO 'prometheus'@'localhost'
2 rows in set (0.00 sec)

but the docs recommend granting SELECT globally

aborrero renamed this task from clouddb1002 (toolsdb secondary) replication broken to toolsdb: full disk on cloudddb1001 broke clouddb1002 (secondary) replication.Feb 17 2022, 11:16 AM

aborrero mentioned this in T224154: Reduce size of linkwatcher db if at all possible.Feb 17 2022, 11:51 AM

For now skipped the broken binlog jumping to the next working one:

# found the next binlog/position to start from
# was stuck at log.275246 event 15499142
# next event is log.275247 position 4 (binlogs start at 4)
# You can use mysqlbinlog to verify
# on the slave
sudo mysql
mysql> STOP SLAVE;
# verify that it stopped
mysql> SHOW SLAVE STATUS\G
# check Slave_IO_Running and Slave_SQL_Running should be no
mysql> CHANGE MASTER TO MASTER_LOG_FILE='log.275247', MASTER_LOG_POS=4;
mysql> START SLAVE;
# Then see that the slave picks up
mysql> SHOW SLAVE STATUS\G
# check Slave_IO_Running and Slave_SQL_Running should be yes
# and Seconds_Behind_Master should be going down bit by bit

aborrero lowered the priority of this task from Unbreak Now! to High.Feb 17 2022, 2:27 PM

dcaro added a subtask: T301993: [toolsdb] Enable gtid to help replication recovery.Feb 17 2022, 3:20 PM

dcaro added a subtask: T301994: [toolsdb] Add replication alerting.Feb 17 2022, 3:22 PM

RhinosF1 subscribed.Feb 17 2022, 4:30 PM

The replica is still catching up, will leave it for the time being, will close the task once it's back in sync with the master.

taavi mentioned this in T301949: ToolsDB upgrade => Bullseye, MariaDB 10.4.Feb 17 2022, 6:04 PM

The replication is cought up \o/

Will close this task, and follow up on the subtasks.

dcaro closed this task as Resolved.Feb 19 2022, 6:55 AM

Andrew renamed this task from toolsdb: full disk on cloudddb1001 broke clouddb1002 (secondary) replication to toolsdb: full disk on clouddb1001 broke clouddb1002 (secondary) replication.Feb 23 2022, 3:38 PM

taavi closed subtask T301993: [toolsdb] Enable gtid to help replication recovery as Resolved.May 17 2022, 2:16 PM

fnegri closed subtask T301994: [toolsdb] Add replication alerting as Resolved.Apr 17 2023, 1:02 PM

	F34954108: image.png
	Feb 17 2022, 10:01 AM

toolsdb: full disk on clouddb1001 broke clouddb1002 (secondary) replicationClosed, ResolvedPublicActions

Description

Related ObjectsSearch...

Event Timeline

toolsdb: full disk on clouddb1001 broke clouddb1002 (secondary) replication
Closed, ResolvedPublic
Actions

Related Objects
Search...