Page MenuHomePhabricator

[toolsdb] Replica is frequently lagging behind the primary
Open, MediumPublic

Description

The alert ToolsDBReplicationLagIsTooHigh triggers about once a month, and usually takes a few days to recover.

Check the list of sub-tasks below for a list of all occurrences. Each sub-task contains the database that caused the issue, so far they were:

  • s51698__yetkin (4 times)
  • s54113__spacemedia (twice)
  • s51434__mixnmatch_p (once)
  • s55593__PAGES (once)
  • s55462__imagehashpublic_p (once)
  • s53685__editgroups (once)

This is generally caused by big deletes in a single transaction. Using row-based replication (RBR) these deletes translate into thousands of events to replicate.

We should investigate what we can do to make this less frequent. One option is to add indexes to columns affected by the deletes. Another is to migrate the databases causing this issue to dedicated Trove instances (related: T291782: Migrate largest ToolsDB users to Trove).

Related Objects

Event Timeline

fnegri triaged this task as Medium priority.Feb 15 2024, 12:29 PM
fnegri updated the task description. (Show Details)

One thing we should probably check is how long the problematic queries take to complete on the primary host:

  • Do they take longer to complete in the replica because of RBR replication or do they take a very long time in the primary too?
  • Are they getting logged in the slow-query log?
  • Would setting a limit to query execution time in the primary help?

Do they take longer to complete in the replica because of RBR replication or do they take a very long time in the primary too?
Are they getting logged in the slow-query log?
Would setting a limit to query execution time in the primary help?

The latest query to cause a replication lag is T368250: [toolsdb] ToolsToolsDBReplicationLagIsTooHigh - 2024-06-21 and it is not logged in the primary slow-query log. We already have a limit to query execution time (introduced in this patch) but it's not preventing this issue.

We are currently using binlog_format=ROW. Setting binlog_format=MIXED (the same that is used in production) is likely to help, because it will hopefully choose the "statement" mode for the problematic queries that are now causing issues with replication. Docs: https://mariadb.com/kb/en/binary-log-formats/

We are currently using slave_parallel_mode=conservative. Setting slave_parallel_mode=optimistic (the same that is used in production) is likely to help because it should lead to more efficient CPU usage. Docs: https://mariadb.com/kb/en/parallel-replication/

fnegri updated the task description. (Show Details)

We are currently using slave_parallel_mode=conservative. Setting slave_parallel_mode=optimistic (the same that is used in production) is likely to help because it should lead to more efficient CPU usage.

Since the upgrade to MariaDB 10.6 (T352206: [toolsdb] Upgrade to MariaDB 10.6), we are now using slave_parallel_mode=optimistic as it's the new default in MariaDB.

MariaDB [(none)]> SHOW GLOBAL VARIABLES LIKE "%parallel%";
+-------------------------------+------------+
| Variable_name                 | Value      |
+-------------------------------+------------+
| slave_domain_parallel_threads | 0          |
| slave_parallel_max_queued     | 131072     |
| slave_parallel_mode           | optimistic |
| slave_parallel_threads        | 12         |
| slave_parallel_workers        | 12         |
+-------------------------------+------------+