Page MenuHomePhabricator

[toolsdb] ToolsToolsDBReplicationLagIsTooHigh - 2023-09-01
Closed, ResolvedPublic

Description

The replica tools-db-2 is currently lagging 2.7 hours behind the primary tools-db-1 (Grafana chart). This is (correctly) triggering the alert ToolsToolsDBReplicationLagIsTooHigh.

SHOW SLAVE STATUS\g in the replica shows the replication is active, but it's taking hours to process a single transaction.

This happened before (T341891, T338031, T343819). The replica usually catches up after a couple of days, without any intervention.

I followed the runbook and found the database, the table and the query causing the issue, it's s54113__spacemedia.dvids_video_files and the query is:

UPDATE dvids_video_files SET dvids_video_repo_id = 'video' WHERE dvids_video_repo_id = '5'

I have acked the alert until Monday, hoping that this will resolve by itself. On Monday I will check if it's possible to add an index to that table to prevent this from happening in the future.

Event Timeline

fnegri changed the task status from Open to In Progress.Sep 1 2023, 4:41 PM

There was indeed a missing index on the column dvids_video_files.dvids_video_repo_id, but the whole table dvids_video_files has since been deleted in the primary.

The replica is still catching up, Exec_Master_Log_Pos hasn't moved and is still at 15196207. I will leave it running for a couple more days as I expect it will eventually complete that transaction and catch up with the primary.

Looking at the replica CPU usage, I noticed it's at 100% but only using 1 out of 16 CPU cores. We probably want to enable parallel replication, adding the following option to my.cnf:

slave_parallel_threads = 12 # Let's keep 4 cores for non-replication activities

Change 954722 had a related patch set uploaded (by FNegri; author: FNegri):

[operations/puppet@production] [toolsdb] Enable parallel replication

https://gerrit.wikimedia.org/r/954722

Mentioned in SAL (#wikimedia-cloud) [2023-09-05T11:00:48Z] <dhinus> restarting mariadb on toolsdb-2 (replica) to test slave_parallel_threads (T345450)

I applied the patch manually to tools-db-2 and restarted MariaDB. Both the load and disk activity significantly increased, but htop shows only 3 CPU cores have a visible load, and not 12. It could be related to the fact slave_parallel_mode is by default set to conservative, but I would still expect more parallelism to be possible as we have so many independent databases.

Change 954722 merged by FNegri:

[operations/puppet@production] [toolsdb] Enable parallel replication

https://gerrit.wikimedia.org/r/954722

SHOW PROCESSLIST; shows that there are now 12 replication threads (as expected) but most of them are idle:

MariaDB [(none)]> SHOW PROCESSLIST;
+-----+-----------------+-----------+------+--------------+-------+-----------------------------------------------+--------------------------------------------------------------------------------------------+----------+
| Id  | User            | Host      | db   | Command      | Time  | State                                         | Info                                                                                       | Progress |
+-----+-----------------+-----------+------+--------------+-------+-----------------------------------------------+--------------------------------------------------------------------------------------------+----------+
|   3 | system user     |           | NULL | Daemon       |  NULL | InnoDB purge worker                           | NULL                                                                                       |    0.000 |
|   1 | system user     |           | NULL | Daemon       |  NULL | InnoDB purge coordinator                      | NULL                                                                                       |    0.000 |
|   2 | system user     |           | NULL | Daemon       |  NULL | InnoDB purge worker                           | NULL                                                                                       |    0.000 |
|   4 | system user     |           | NULL | Daemon       |  NULL | InnoDB purge worker                           | NULL                                                                                       |    0.000 |
|   5 | system user     |           | NULL | Daemon       |  NULL | InnoDB shutdown handler                       | NULL                                                                                       |    0.000 |
|   7 | event_scheduler | localhost | NULL | Daemon       | 10356 | Waiting on empty queue                        | NULL                                                                                       |    0.000 |
|  11 | system user     |           | NULL | Slave_IO     | 10347 | Waiting for master to send event              | NULL                                                                                       |    0.000 |
|  13 | system user     |           | NULL | Slave_worker | 10347 | Waiting for work from SQL thread              | NULL                                                                                       |    0.000 |
|  15 | system user     |           | NULL | Slave_worker | 10347 | Waiting for work from SQL thread              | NULL                                                                                       |    0.000 |
|  14 | system user     |           | NULL | Slave_worker | 10347 | Waiting for work from SQL thread              | NULL                                                                                       |    0.000 |
|  16 | system user     |           | NULL | Slave_worker | 10347 | Waiting for work from SQL thread              | NULL                                                                                       |    0.000 |
|  17 | system user     |           | NULL | Slave_worker | 10347 | Waiting for work from SQL thread              | NULL                                                                                       |    0.000 |
|  19 | system user     |           | NULL | Slave_worker | 10347 | Waiting for work from SQL thread              | NULL                                                                                       |    0.000 |
|  18 | system user     |           | NULL | Slave_worker | 10347 | Waiting for work from SQL thread              | NULL                                                                                       |    0.000 |
|  20 | system user     |           | NULL | Slave_worker | 10347 | Waiting for work from SQL thread              | NULL                                                                                       |    0.000 |
|  21 | system user     |           | NULL | Slave_worker | 10347 | Waiting for work from SQL thread              | NULL                                                                                       |    0.000 |
|  22 | system user     |           | NULL | Slave_worker | 10347 | Waiting for work from SQL thread              | NULL                                                                                       |    0.000 |
|  23 | system user     |           | NULL | Slave_worker | 10347 | Waiting for work from SQL thread              | NULL                                                                                       |    0.000 |
|  24 | system user     |           | NULL | Slave_worker |     2 | Update_rows_log_event::find_row(-1)           | UPDATE dvids_video_files SET dvids_video_repo_id = 'video' WHERE dvids_video_repo_id = '5' |    0.000 |
|  12 | system user     |           | NULL | Slave_SQL    |  4643 | Waiting for room in worker thread event queue | NULL                                                                                       |    0.000 |
| 376 | root            | localhost | NULL | Query        |     0 | Init                                          | SHOW PROCESSLIST                                                                           |    0.000 |
+-----+-----------------+-----------+------+--------------+-------+-----------------------------------------------+--------------------------------------------------------------------------------------------+----------+
21 rows in set (0.001 sec)

I tried tweaking slave_parallel_max_queued as recommended here but it didn't have an impact. I also tried changing slave_parallel_mode from conservative to optimistic, but that also didn't have any visible effect.

fnegri triaged this task as High priority.Sep 6 2023, 12:54 PM

Replication lag is still increasing, so for some reason the situation is worse than the previous occurrences of this problem, when the replica eventually caught up after 2 or 3 days. I will keep this task open until I complete T344717: [toolsdb] test creating a new replica host and completely replace the existing replica with a new one.

As an interim solution, I added s54113__spacemedia.dvids_video_files to the tables ignored by replication:

SET GLOBAL replicate_wild_ignore_table = 's51698\_\_yetkin.wanted\_items,s54113__spacemedia.dvids_video_files';

Given that the table no longer exists in the primary, once the replica has caught up with the primary, the table can be safely removed from replicate_wild_ignore_table.

The replica is back in sync, and I removed s54113__spacemedia.dvids_video_files from replicate_wild_ignore_table.

I will resolve this task, but I still plan to work on T344717: [toolsdb] test creating a new replica host soon.

Screenshot 2023-09-20 at 11.06.34.png (1×2 px, 360 KB)