Page MenuHomePhabricator

[toolsdb] ToolsToolsDBReplicationLagIsTooHigh - 2025-06-30
Closed, ResolvedPublic

Description

tools-db-5 is lagging behind the primary. SHOW FULL PROCESSLIST shows it's busy replicating this query:

|    2677 | system user     |                 | s53993__prc_admin_p | Slave_worker |       3 | Update_rows_log_event::find_row(-1) on table `admin_data_online` | UPDATE admin_data_online
SET process_start_time = NULL AND fetch_status = NULL
WHERE fetch_status = 'processing'                                                                                                                             |    0.000 |

The table s53993__prc_admin_p.admin_data_online has 983790 rows, and no index.

Details

Related Changes in GitLab:
TitleReferenceAuthorSource BranchDest Branch
volumes: fix typo in volume descriptionrepos/cloud/toolforge/tofu-provisioning!62fnegrimain-Idf33272d704da9c99ed40a25fc23b3931d49b9c0main
volumes: remove tools-db-5-datarepos/cloud/toolforge/tofu-provisioning!59fnegrimain-e497main-a161
volumes: remove unused tools-db-6-data-temprepos/cloud/toolforge/tofu-provisioning!58fnegrimain-a161main
dns: point tools-readonly.db.svc to new replicarepos/cloud/cloud-vps/tofu-infra!252fnegriT398170main
toolsdb: new replica hostrepos/cloud/toolforge/tofu-provisioning!57fnegriT398170main
Customize query in GitLab

Event Timeline

@Hamishcn your tool hamishbot is causing some performance issues on ToolsDB. Please consider adding some indexes to the admin_data_online table.

In the meantime, I'm gonna fix the replication as I did in previous similar instances:

STOP REPLICA;
SET sql_log_bin = 0;
/* run the stuck query manually, then: */
SET GLOBAL sql_slave_skip_counter = 1;
SET sql_log_bin = 1;
START REPLICA;

Replication is flowing again and should catch up soon.

Replication failed shortly after restarting with:

Last_SQL_Error: Could not execute Delete_rows_v1 event on table s53993__prc_admin_p.admin_data_online; Can't find record in 'admin_data_online', Error_code: 1032; handler error HA_ERR_END_OF_FILE; the event's master log log.050265, end_log_pos 97220305

This would be compatible with me running SET GLOBAL sql_slave_skip_counter = 1; without running the skipped transaction manually, but I'm sure I did run it.

Looking at the mariadb logs I realized what went wrong:

The replication was stuck on transaction with GTID 2886729896-2886729896-3959962230, that looks like:

#250629  2:51:40 server id 2886729896  end_log_pos 70741099 CRC32 0x99e2b3f7    GTID 2886729896-2886729896-3959962231 trans
[...]
#Q> UPDATE admin_data_online
#Q> SET process_start_time = NULL AND fetch_status = NULL
#Q> WHERE fetch_status = 'processing'

I stopped the replication, ran the query manually on the replica and skipped to the next GTID using SQL_SLAVE_SKIP_COUNTER=1:

Jun 30 09:02:00 tools-db-5 mysqld[3565868]: 2025-06-30  9:02:00 7301542 [Note] 'SQL_SLAVE_SKIP_COUNTER=1' executed at relay_log_file='./tools-db-5-relay-bin.000001', relay_log_pos='4', master_log_name='log.050252', master_log_pos='70741057', GTID '2886729896-2886729896-3959962230';  and new position at relay_log_file='./tools-db-5-relay-bin.000002', relay_log_pos='26788291', master_log_name='log.050252', master_log_pos='97528639', GTID '2886729896-2886729896-3959962231';

I restarted replication and I saw it was immediately stuck again on the same query. I incorrectly assumed that SQL_SLAVE_SKIP_COUNTER did not have an effect and it was still trying to replicate the same transaction, but in fact it was a different transaction, with GTID 2886729896-2886729896-3959975250:

#250629  2:52:38 server id 2886729896  end_log_pos 303621 CRC32 0x54e0c5cd      GTID 2886729896-2886729896-3959975250 trans
[...]
#Q> UPDATE admin_data_online
#Q> SET fetch_status = NULL
#Q> WHERE fetch_status = 'processing'

I did issue SQL_SLAVE_SKIP_COUNTER again but this time I did not run the query manually on the replica:

Jun 30 09:08:52 tools-db-5 mysqld[3565868]: 2025-06-30  9:08:52 7301773 [Note] 'SQL_SLAVE_SKIP_COUNTER=1' executed at relay_log_file='./tools-db-5-relay-bin.000001', relay_log_pos='4', master_log_name='log.050253', master_log_pos='303579', GTID '2886729896-2886729896-3959975249';  and new position at relay_log_file='./tools-db-5-relay-bin.000002', relay_log_pos='19268947', master_log_name='log.050253', master_log_pos='19571817', GTID '2886729896-2886729896-3959975250';

That meant that second query was never executed on the replica, and the table in the replica became inconsistent with the primary. This led to the replication failing a few minutes after:

Jun 30 09:12:52 tools-db-5 mysqld[3565868]: 2025-06-30  9:12:52 7301779 [ERROR] Slave SQL: Could not execute Delete_rows_v1 event on table s53993__prc_admin_p.admin_data_online; Can't find record in 'admin_data_online', Error_code: 1032; handler error HA_ERR_END_OF_FILE; the event's master log log.050265, end_log_pos 97220305, Gtid 2886729896-2886729896-3961197437, Internal MariaDB error code: 1032

I will recreate the replica from scratch using the standard and well-tested procedure. This will take a few hours, in the meantime the replica is available for querying but its lag will keep growing.

Mentioned in SAL (#wikimedia-cloud-feed) [2025-06-30T10:43:51Z] <fnegri@cloudcumin1001> START - Cookbook wmcs.vps.create_instance_with_prefix with prefix 'tools-db' (T398170)

Mentioned in SAL (#wikimedia-cloud-feed) [2025-06-30T10:44:11Z] <fnegri@cloudcumin1001> END (FAIL) - Cookbook wmcs.vps.create_instance_with_prefix (exit_code=99) with prefix 'tools-db' (T398170)

Mentioned in SAL (#wikimedia-cloud-feed) [2025-06-30T10:45:10Z] <fnegri@cloudcumin1001> START - Cookbook wmcs.openstack.quota_increase (T398170)

Mentioned in SAL (#wikimedia-cloud-feed) [2025-06-30T10:45:18Z] <fnegri@cloudcumin1001> END (PASS) - Cookbook wmcs.openstack.quota_increase (exit_code=0) (T398170)

Mentioned in SAL (#wikimedia-cloud-feed) [2025-06-30T10:45:37Z] <fnegri@cloudcumin1001> START - Cookbook wmcs.vps.create_instance_with_prefix with prefix 'tools-db' (T398170)

Mentioned in SAL (#wikimedia-cloud-feed) [2025-06-30T10:45:59Z] <fnegri@cloudcumin1001> END (FAIL) - Cookbook wmcs.vps.create_instance_with_prefix (exit_code=99) with prefix 'tools-db' (T398170)

Mentioned in SAL (#wikimedia-cloud-feed) [2025-06-30T10:46:53Z] <fnegri@cloudcumin1001> START - Cookbook wmcs.openstack.quota_increase (T398170)

Mentioned in SAL (#wikimedia-cloud-feed) [2025-06-30T10:46:58Z] <fnegri@cloudcumin1001> END (PASS) - Cookbook wmcs.openstack.quota_increase (exit_code=0) (T398170)

Mentioned in SAL (#wikimedia-cloud-feed) [2025-06-30T10:47:02Z] <fnegri@cloudcumin1001> START - Cookbook wmcs.vps.create_instance_with_prefix with prefix 'tools-db' (T398170)

Mentioned in SAL (#wikimedia-cloud-feed) [2025-06-30T10:51:01Z] <fnegri@cloudcumin1001> END (PASS) - Cookbook wmcs.vps.create_instance_with_prefix (exit_code=0) with prefix 'tools-db' (T398170)

Currently copying the snapshot of tools-db-4 to the new host tools-db-6.
Snapshot was taken with GTID 2886729896-2886729896-3991557998.

tools-db-6 is the new replica host, replicating from the primary tools-db-4. I updated the DNS tools-readonly.db.svc.wikimedia.cloud to point the new host.

DNS propagation should be fairly quick (TTL is 60) so I will delete the old replica tools-db-5 right away.

Mentioned in SAL (#wikimedia-cloud-feed) [2025-07-01T14:30:07Z] <fnegri@cloudcumin1001> START - Cookbook wmcs.vps.remove_instance for instance tools-db-5 (T398170)

Mentioned in SAL (#wikimedia-cloud-feed) [2025-07-01T14:31:02Z] <fnegri@cloudcumin1001> END (PASS) - Cookbook wmcs.vps.remove_instance (exit_code=0) for instance tools-db-5 (T398170)

The new replica caught up with the primary late yesterday, and then started lagging again:

Screenshot 2025-07-02 at 10.33.13.png (614×1 px, 78 KB)

This time it's for a different query on a different db:

MariaDB [(none)]> SHOW FULL PROCESSLIST;
[...]
|     19 | system user     |           | s51698__yetkin | Slave_worker |    16 | Delete_rows_log_event::find_row(-1) on table `visited_pages_agg` | DELETE FROM visited_pages_agg WHERE vpa_year = '2025' AND vpa_month = '6' AND vpa_day = '28' |    0.000 |

s51698__yetkin.visited_pages_agg is also missing a primary index:

MariaDB [(none)]> SHOW CREATE TABLE s51698__yetkin.visited_pages_agg;
[...]
| visited_pages_agg | CREATE TABLE `visited_pages_agg` (
  `vpa_year` int(11) NOT NULL,
  `vpa_month` int(11) NOT NULL,
  `vpa_day` int(11) NOT NULL,
  `vpa_project` varbinary(1) NOT NULL,
  `vpa_page` varbinary(200) NOT NULL,
  `vpa_count` int(11) DEFAULT 0,
  `vpa_mobile_count` int(11) DEFAULT 0,
  KEY `idx_page_title` (`vpa_page`),
  KEY `idx_project` (`vpa_year`,`vpa_month`,`vpa_day`)
) ENGINE=Aria DEFAULT CHARSET=binary PAGE_CHECKSUM=1 |

I will run the query manually with the usual procedure:

STOP REPLICA;
SET sql_log_bin = 0;
/* run the stuck query manually, then: */
SET GLOBAL sql_slave_skip_counter = 1;
SET sql_log_bin = 1;
START REPLICA;

Replication has restarted, but there are some other heavy queries on the same table (s51698__yetkin.visited_pages_agg), it's currently processing a big INSERT that likely took a long time on the primary as well (but less than 30 mins or it would be shown in the slow_query_log of the primary).

I will leave replication running and check back later.

The big INSERT completed, but now there's another big DELETE that will likely take hours because of the missing primary index:

| 399919 | system user     |           | s51698__yetkin | Slave_worker |    35 | Delete_rows_log_event::find_row(-1) on table `visited_pages_agg` | DELETE FROM visited_pages_agg WHERE vpa_year = '2025' AND vpa_month = '6' AND vpa_day = '27' |    0.000 |

I will run this one manually.

This delete was so massive that even running manually it took 10 minutes to complete:

MariaDB [s51698__yetkin]> DELETE FROM visited_pages_agg WHERE vpa_year = '2025' AND vpa_month = '6' AND vpa_day = '27';

Query OK, 533303 rows affected (10 min 35.528 sec)

Replication has now restarted.

The replica is back in sync. 🎉

Screenshot 2025-07-02 at 13.23.59.png (618×1 px, 85 KB)

fnegri updated the task description. (Show Details)