Page MenuHomePhabricator

[toolsdb] ToolsToolsDBReplicationLagIsTooHigh - 2025-07-30
Closed, ResolvedPublic

Description

On tools-db-6 replication is lagging. SHOW FULL PROCESSLIST shows this query:

|      13 | system user     |           | s51434__mixnmatch_p | Slave_worker |      11 | Update_rows_log_event::find_row(-1) on table `auxiliary_broken` | UPDATE auxiliary_broken a
INNER JOIN auxiliary_fix af ON a.aux_p=af.aux_p AND a.aux_name=af.label
SET a.aux_name=af.aux_name |    0.000 |

The table s51434__mixnmatch_p.auxiliary_broken has 1583236 rows and no index:

MariaDB [(none)]> SELECT COUNT(*) FROM s51434__mixnmatch_p.auxiliary_broken;
+----------+
| COUNT(*) |
+----------+
|  1583236 |
+----------+
1 row in set (0.533 sec)

MariaDB [(none)]> SHOW CREATE TABLE  s51434__mixnmatch_p.auxiliary_broken;
+------------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| Table            | Create Table                                                                                                                                                                                                                                                                                                                                                                                                                  |
+------------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| auxiliary_broken | CREATE TABLE `auxiliary_broken` (
  `id` int(11) unsigned NOT NULL DEFAULT 0,
  `entry_id` int(11) unsigned NOT NULL,
  `aux_p` int(10) unsigned NOT NULL,
  `aux_name` varchar(255) CHARACTER SET utf8mb3 COLLATE utf8mb3_general_ci NOT NULL DEFAULT '',
  `in_wikidata` tinyint(1) NOT NULL DEFAULT 0,
  `entry_is_matched` tinyint(1) NOT NULL DEFAULT 0
) ENGINE=InnoDB DEFAULT CHARSET=latin1 COLLATE=latin1_swedish_ci |
+------------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
1 row in set (0.000 sec)

Event Timeline

Tagging the tool maintainers: @Edoardolenzi, @Hjfocs, @Magnus, @MaxFrax96. I will sort this out by applying the slow transaction manually, but to avoid future problems I would recommend adding one or more indexes to the table s51434__mixnmatch_p.auxiliary_broken.

Applied that transaction manually. Replication restarted and got stuck again on a different query on the same table:

| 6275807 | system user     |           | s51434__mixnmatch_p | Slave_worker |       3 | Delete_rows_log_event::find_row(-1) on table `auxiliary_broken` | DELETE FROM auxiliary_broken WHERE aux_name RLIKE "^Q\\d+$" |    0.000 |

Applied the second transaction manually and replication resumed.

One more query on the same table:

| 6277543 | system user     |           | s51434__mixnmatch_p | Slave_worker |       1 | Update_rows_log_event::find_row(-1) on table `auxiliary_broken` | UPDATE auxiliary_broken a INNER JOIN auxiliary_fix af ON a.aux_p=af.aux_p AND a.aux_name=af.label SET a.aux_name=af.aux_name |    0.000 |

And one more:

| 6279151 | system user     |           | s51434__mixnmatch_p | Slave_worker |      19 | Delete_rows_log_event::find_row(-1) on table `auxiliary_broken` | DELETE FROM auxiliary_broken WHERE aux_name RLIKE "^Q\\d+$" |    0.000 |

There's more, looks like very similar queries were executed multiple times on that table. I will continue fixing this later today.

fnegri changed the task status from Open to In Progress.Jul 30 2025, 10:24 AM
fnegri triaged this task as High priority.
fnegri moved this task from Backlog to In progress on the cloud-services-team (FY2025/26-Q1-Q2) board.

My bad, I'll throw some indexes on it

@Magnus thank you for your quick intervention!

The replication is back in sync:

Screenshot 2025-07-30 at 15.35.56.png (1×2 px, 366 KB)

Side note: while running one of the manual queries above, I forgot to run SET sql_log_bin = 0; and ended up adding that query to the binlogs of the replica, which is not a big issue but it caused the GTID to show an extra transaction:

MariaDB [(none)]> SHOW GLOBAL VARIABLES LIKE 'gtid%';
+-------------------------+----------------------------------------------------------+
| Variable_name           | Value                                                    |
+-------------------------+----------------------------------------------------------+
| gtid_binlog_pos         | 2886729896-2886729896-4603068434,2886731025-2886731025-1 |
| gtid_binlog_state       | 2886729896-2886729896-4603068434,2886731025-2886731025-1 |
| gtid_cleanup_batch_size | 64                                                       |
| gtid_current_pos        | 2886729896-2886729896-4603068434,2886731025-2886731025-1 |
| gtid_domain_id          | 0                                                        |
| gtid_ignore_duplicates  | OFF                                                      |
| gtid_pos_auto_engines   |                                                          |
| gtid_slave_pos          | 2886729896-2886729896-4603068434                         |
| gtid_strict_mode        | OFF                                                      |
+-------------------------+----------------------------------------------------------+
9 rows in set (0.002 sec)

I fixed this by running STOP REPLICA; RESET MASTER; START REPLICA;. After that we're back to a single domain_id in the gtid variables:

MariaDB [(none)]> SHOW GLOBAL VARIABLES LIKE 'gtid%';
+-------------------------+----------------------------------+
| Variable_name           | Value                            |
+-------------------------+----------------------------------+
| gtid_binlog_pos         | 2886729896-2886729896-4603090634 |
| gtid_binlog_state       | 2886729896-2886729896-4603090634 |
| gtid_cleanup_batch_size | 64                               |
| gtid_current_pos        | 2886729896-2886729896-4603090634 |
| gtid_domain_id          | 0                                |
| gtid_ignore_duplicates  | OFF                              |
| gtid_pos_auto_engines   |                                  |
| gtid_slave_pos          | 2886729896-2886729896-4603090634 |
| gtid_strict_mode        | OFF                              |
+-------------------------+----------------------------------+
9 rows in set (0.001 sec)