Page MenuHomePhabricator

[toolsdb] ToolsToolsDBReplicationLagIsTooHigh - 2023-08-08
Closed, ResolvedPublic


The replica tools-db-2 is currently lagging 18 hours behind the primary tools-db-1 (Grafana chart). This is (correctly) triggering the alert ToolsToolsDBReplicationLagIsTooHigh.

SHOW SLAVE STATUS\g in the replica shows the replication is active, but it's taking hours to process a single transaction.

This happened before (T341891 and T338031). Both times the replica caught up after a couple of days, without any intervention. I will check if the database causing this issue is the same or a new one.

Event Timeline

fnegri changed the task status from Open to In Progress.Aug 8 2023, 3:14 PM
fnegri triaged this task as High priority.
fnegri moved this task from Backlog to ToolsDB on the Data-Services board.

The long transaction is on a different database than the two ones which caused this issue in the past. This time it's visited_pages_agg in the database s51698__yetkin.

fnegri@tools-db-1:/srv/labsdb/binlogs$ sudo mysqlbinlog --base64-output=decode-rows --verbose --start-position 62423652 log.027343 |less

/*!40019 SET @@session.max_insert_delayed_threads=0*/;
# at 62423652
#230807 20:15:00 server id 2886731301  end_log_pos 62423694 CRC32 0x8c1fcd20    GTID 2886731301-2886731301-1694631045
/*!100101 SET @@session.skip_parallel_replication=0*//*!*/;
/*!100001 SET @@session.gtid_domain_id=2886731301*//*!*/;
/*!100001 SET @@session.server_id=2886731301*//*!*/;
/*!100001 SET @@session.gtid_seq_no=1694631045*//*!*/;
# at 62423694
# at 62423808
#230807 20:15:00 server id 2886731301  end_log_pos 62423808 CRC32 0xc11ab647    Annotate_rows:
#Q> DELETE FROM visited_pages_agg WHERE vpa_year = '2023' AND vpa_month = '8' AND vpa_day = '6'
#230807 20:15:00 server id 2886731301  end_log_pos 62423888 CRC32 0x5ac1947e    Table_map: `s51698__yetkin`.`visited_pages_agg` mapped to number 131359729
# at 62423888


The lag is decreasing as expected, though the decrease rate is slower compared to the one we saw in T338031 and T341891:

Screenshot 2023-08-09 at 15.15.37.png (612×1 px, 160 KB)

I took the chance to update the runbook with more information.

The replica is back in sync, I'm not sure why it took longer than usual to catch up.

Screenshot 2023-08-10 at 16.58.24.png (622×1 px, 134 KB)

I also noticed there are still some small spikes after it reached zero, so it seems there is some kind of increased activity on the primary that's causing the replica to have small lag spikes, but they're very small so I wouldn't worry about it for now.

Screenshot 2023-08-10 at 17.00.59.png (624×1 px, 185 KB)

I checked the various metrics for the primary and I didn't see anything odd there.