[toolsdb] ToolsToolsDBReplicationLagIsTooHigh - 2023-09-01
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	fnegri
	Sep 1 2023, 4:30 PM

Description

The replica tools-db-2 is currently lagging 2.7 hours behind the primary tools-db-1 (Grafana chart). This is (correctly) triggering the alert ToolsToolsDBReplicationLagIsTooHigh.

SHOW SLAVE STATUS\g in the replica shows the replication is active, but it's taking hours to process a single transaction.

This happened before (T341891, T338031, T343819). The replica usually catches up after a couple of days, without any intervention.

I followed the runbook and found the database, the table and the query causing the issue, it's s54113__spacemedia.dvids_video_files and the query is:

UPDATE dvids_video_files SET dvids_video_repo_id = 'video' WHERE dvids_video_repo_id = '5'

I have acked the alert until Monday, hoping that this will resolve by itself. On Monday I will check if it's possible to add an index to that table to prevent this from happening in the future.

Details

	Subject	Repo	Branch	Lines +/-
	[toolsdb] Enable parallel replication	operations/puppet	production	+2 -0

Customize query in gerrit

Related Objects
Search...

		Status	Subtype	Assigned	Task
		Open		None	T357624 [toolsdb] Replica is frequently lagging behind the primary
		Resolved		fnegri	T345450 [toolsdb] ToolsToolsDBReplicationLagIsTooHigh - 2023-09-01

Event Timeline

fnegri created this task.Sep 1 2023, 4:30 PM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptSep 1 2023, 4:30 PM

JJMC89 moved this task from Backlog to ToolsDB on the Data-Services board.Sep 1 2023, 4:38 PM

fnegri changed the task status from Open to In Progress.Sep 1 2023, 4:41 PM

fnegri moved this task from Backlog to In progress on the cloud-services-team (FY2023/2024-Q1-Q2) board.

There was indeed a missing index on the column dvids_video_files.dvids_video_repo_id, but the whole table dvids_video_files has since been deleted in the primary.

The replica is still catching up, Exec_Master_Log_Pos hasn't moved and is still at 15196207. I will leave it running for a couple more days as I expect it will eventually complete that transaction and catch up with the primary.

Looking at the replica CPU usage, I noticed it's at 100% but only using 1 out of 16 CPU cores. We probably want to enable parallel replication, adding the following option to my.cnf:

slave_parallel_threads = 12 # Let's keep 4 cores for non-replication activities

Change 954722 had a related patch set uploaded (by FNegri; author: FNegri):

[operations/puppet@production] [toolsdb] Enable parallel replication

https://gerrit.wikimedia.org/r/954722

gerritbot added a project: Patch-For-Review.Sep 4 2023, 4:28 PM

Mentioned in SAL (#wikimedia-cloud) [2023-09-05T11:00:48Z] <dhinus> restarting mariadb on toolsdb-2 (replica) to test slave_parallel_threads (T345450)

I applied the patch manually to tools-db-2 and restarted MariaDB. Both the load and disk activity significantly increased, but htop shows only 3 CPU cores have a visible load, and not 12. It could be related to the fact slave_parallel_mode is by default set to conservative, but I would still expect more parallelism to be possible as we have so many independent databases.

Change 954722 merged by FNegri:

[operations/puppet@production] [toolsdb] Enable parallel replication

https://gerrit.wikimedia.org/r/954722

Maintenance_bot removed a project: Patch-For-Review.Sep 5 2023, 2:10 PM

SHOW PROCESSLIST; shows that there are now 12 replication threads (as expected) but most of them are idle:

MariaDB [(none)]> SHOW PROCESSLIST;
+-----+-----------------+-----------+------+--------------+-------+-----------------------------------------------+--------------------------------------------------------------------------------------------+----------+
| Id  | User            | Host      | db   | Command      | Time  | State                                         | Info                                                                                       | Progress |
+-----+-----------------+-----------+------+--------------+-------+-----------------------------------------------+--------------------------------------------------------------------------------------------+----------+
|   3 | system user     |           | NULL | Daemon       |  NULL | InnoDB purge worker                           | NULL                                                                                       |    0.000 |
|   1 | system user     |           | NULL | Daemon       |  NULL | InnoDB purge coordinator                      | NULL                                                                                       |    0.000 |
|   2 | system user     |           | NULL | Daemon       |  NULL | InnoDB purge worker                           | NULL                                                                                       |    0.000 |
|   4 | system user     |           | NULL | Daemon       |  NULL | InnoDB purge worker                           | NULL                                                                                       |    0.000 |
|   5 | system user     |           | NULL | Daemon       |  NULL | InnoDB shutdown handler                       | NULL                                                                                       |    0.000 |
|   7 | event_scheduler | localhost | NULL | Daemon       | 10356 | Waiting on empty queue                        | NULL                                                                                       |    0.000 |
|  11 | system user     |           | NULL | Slave_IO     | 10347 | Waiting for master to send event              | NULL                                                                                       |    0.000 |
|  13 | system user     |           | NULL | Slave_worker | 10347 | Waiting for work from SQL thread              | NULL                                                                                       |    0.000 |
|  15 | system user     |           | NULL | Slave_worker | 10347 | Waiting for work from SQL thread              | NULL                                                                                       |    0.000 |
|  14 | system user     |           | NULL | Slave_worker | 10347 | Waiting for work from SQL thread              | NULL                                                                                       |    0.000 |
|  16 | system user     |           | NULL | Slave_worker | 10347 | Waiting for work from SQL thread              | NULL                                                                                       |    0.000 |
|  17 | system user     |           | NULL | Slave_worker | 10347 | Waiting for work from SQL thread              | NULL                                                                                       |    0.000 |
|  19 | system user     |           | NULL | Slave_worker | 10347 | Waiting for work from SQL thread              | NULL                                                                                       |    0.000 |
|  18 | system user     |           | NULL | Slave_worker | 10347 | Waiting for work from SQL thread              | NULL                                                                                       |    0.000 |
|  20 | system user     |           | NULL | Slave_worker | 10347 | Waiting for work from SQL thread              | NULL                                                                                       |    0.000 |
|  21 | system user     |           | NULL | Slave_worker | 10347 | Waiting for work from SQL thread              | NULL                                                                                       |    0.000 |
|  22 | system user     |           | NULL | Slave_worker | 10347 | Waiting for work from SQL thread              | NULL                                                                                       |    0.000 |
|  23 | system user     |           | NULL | Slave_worker | 10347 | Waiting for work from SQL thread              | NULL                                                                                       |    0.000 |
|  24 | system user     |           | NULL | Slave_worker |     2 | Update_rows_log_event::find_row(-1)           | UPDATE dvids_video_files SET dvids_video_repo_id = 'video' WHERE dvids_video_repo_id = '5' |    0.000 |
|  12 | system user     |           | NULL | Slave_SQL    |  4643 | Waiting for room in worker thread event queue | NULL                                                                                       |    0.000 |
| 376 | root            | localhost | NULL | Query        |     0 | Init                                          | SHOW PROCESSLIST                                                                           |    0.000 |
+-----+-----------------+-----------+------+--------------+-------+-----------------------------------------------+--------------------------------------------------------------------------------------------+----------+
21 rows in set (0.001 sec)

I tried tweaking slave_parallel_max_queued as recommended here but it didn't have an impact. I also tried changing slave_parallel_mode from conservative to optimistic, but that also didn't have any visible effect.

fnegri triaged this task as High priority.Sep 6 2023, 12:54 PM

Replication lag is still increasing, so for some reason the situation is worse than the previous occurrences of this problem, when the replica eventually caught up after 2 or 3 days. I will keep this task open until I complete T344717: [toolsdb] test creating a new replica host and completely replace the existing replica with a new one.

As an interim solution, I added s54113__spacemedia.dvids_video_files to the tables ignored by replication:

SET GLOBAL replicate_wild_ignore_table = 's51698\_\_yetkin.wanted\_items,s54113__spacemedia.dvids_video_files';

Given that the table no longer exists in the primary, once the replica has caught up with the primary, the table can be safely removed from replicate_wild_ignore_table.

The replica is back in sync, and I removed s54113__spacemedia.dvids_video_files from replicate_wild_ignore_table.

I will resolve this task, but I still plan to work on T344717: [toolsdb] test creating a new replica host soon.

Screenshot 2023-09-20 at 11.06.34.png (1×2 px, 360 KB)

fnegri moved this task from In progress to Done on the cloud-services-team (FY2023/2024-Q1-Q2) board.Sep 20 2023, 9:09 AM

fnegri mentioned this in T357624: [toolsdb] Replica is frequently lagging behind the primary.Feb 15 2024, 12:28 PM

fnegri added a parent task: T357624: [toolsdb] Replica is frequently lagging behind the primary.Jun 24 2024, 9:15 AM

	F37741964: Screenshot 2023-09-20 at 11.06.34.png
	Sep 20 2023, 9:07 AM

[toolsdb] ToolsToolsDBReplicationLagIsTooHigh - 2023-09-01Closed, ResolvedPublicActions

Description

Details

Related ObjectsSearch...

Event Timeline

[toolsdb] ToolsToolsDBReplicationLagIsTooHigh - 2023-09-01
Closed, ResolvedPublic
Actions

Related Objects
Search...