Page MenuHomePhabricator

[toolsdb] Replica is frequently lagging behind the primary
Open, MediumPublic

Description

The alert ToolsDBReplicationLagIsTooHigh triggers about once a month, and usually takes a few days to recover:

*T357979: [toolsdb] ToolsToolsDBReplicationLagIsTooHigh - 2024-02-20
*T357264: [toolsdb] ToolsToolsDBReplicationLagIsTooHigh - 2024-02-12
*T355411: [toolsdb] ToolsToolsDBReplicationLagIsTooHigh - 2024-01-19
*T345450: [toolsdb] ToolsToolsDBReplicationLagIsTooHigh - 2023-09-01
*T343819: [toolsdb] ToolsToolsDBReplicationLagIsTooHigh - 2023-08-08
*T341891: [toolsdb] ToolsToolsDBReplicationLagIsTooHigh - 2023-07-13
*T338031: [toolsdb] ToolsToolsDBReplicationLagIsTooHigh - 2023-06-02

Each task contains the database that caused the issue, so far they were:

  • s54113__spacemedia (twice)
  • s51434__mixnmatch_p (once)
  • s51698__yetkin (3 times)
  • s55593__PAGES (once)

This is generally caused by big deletes in a single transaction. Using row-based replication (RBR) these deletes translate into thousands of events to replicate.

We should investigate what we can do to make this less frequent. One option is to add indexes to columns affected by the deletes. Another is to migrate the databases causing this issue to dedicated Trove instances (related: T291782: Migrate largest ToolsDB users to Trove).

Event Timeline

fnegri triaged this task as Medium priority.Feb 15 2024, 12:29 PM
fnegri updated the task description. (Show Details)

One thing we should probably check is how long the problematic queries take to complete on the primary host:

  • Do they take longer to complete in the replica because of RBR replication or do they take a very long time in the primary too?
  • Are they getting logged in the slow-query log?
  • Would setting a limit to query execution time in the primary help?