Page MenuHomePhabricator

[toolsdb] Destroy tools-db-4 and create new host
Closed, ResolvedPublic

Description

tools-db-4 is the current primary, and ran out of disk space in T409244: Toolforge outage: toolsdb out of space.

The file ibdata1 grew from 115G to 985G, we are still investigating what caused this increase. The replica host tools-db-6 is in sync and was not affected, ibdata1 on that host is still at 115G.

The plan is:

  • create a new replica tools-db-7, from a snapshot of tools-db-6
  • fail over the primary from tools-db-4 to tools-db-6
  • investigate[0][1] what's in the big ibdata1 in tools-db-4
  • destroy tools-db-4
  • delete volume tools-db-4-data
  • stop using Tofu to manage tools-db data volumes !102

[0] https://www.percona.com/blog/why-is-the-ibdata1-file-continuously-growing-in-mysql/
[1] https://www.percona.com/blog/deep-dive-mysqls-innochecksum-tool/

Details

Related Changes in GitLab:
TitleReferenceAuthorSource BranchDest Branch
Fail over tools-dbrepos/cloud/cloud-vps/tofu-infra!281fnegriT409287main
Add new data volume for tools-db-7repos/cloud/toolforge/tofu-provisioning!100fnegrimain-Ie9e70ccc9cfa2260c5efb252991cee095ea37849main
Customize query in GitLab

Event Timeline

fnegri renamed this task from [toolsdb] Destroy tools-db-4 and create new to [toolsdb] Destroy tools-db-4 and create new host.Nov 5 2025, 1:21 PM
fnegri changed the task status from Open to In Progress.
fnegri claimed this task.
fnegri triaged this task as Medium priority.

Mentioned in SAL (#wikimedia-cloud-feed) [2025-11-05T13:57:44Z] <fnegri@cloudcumin1001> START - Cookbook wmcs.vps.create_instance_with_prefix with prefix 'tools-db' (T409287)

Mentioned in SAL (#wikimedia-cloud-feed) [2025-11-05T14:02:33Z] <fnegri@cloudcumin1001> END (PASS) - Cookbook wmcs.vps.create_instance_with_prefix (exit_code=0) with prefix 'tools-db' (T409287)

Mentioned in SAL (#wikimedia-cloud-feed) [2025-11-05T14:53:34Z] <fnegri@cloudcumin1001> START - Cookbook wmcs.vps.refresh_puppet_certs on tools-db-7.tools.eqiad1.wikimedia.cloud (T409287)

Mentioned in SAL (#wikimedia-cloud-feed) [2025-11-05T14:55:01Z] <fnegri@cloudcumin1001> END (PASS) - Cookbook wmcs.vps.refresh_puppet_certs (exit_code=0) on tools-db-7.tools.eqiad1.wikimedia.cloud (T409287)

Snapshot created from tools-db-6 with the position below:

MariaDB [(none)]> SELECT @@gtid_current_pos;
+----------------------------------+
| @@gtid_current_pos               |
+----------------------------------+
| 2886729896-2886729896-6675606183 |
+----------------------------------+

Rsync completed in 5h8m, but of course there's a snatch: I thought I could upgrade to trixie, but mariadb 10.6 is only available on bookworm. :( So I have to start over and stick with

But maybe there are some good news: it looks like recent versions of Cinder improved the snapshot handling, I can now delete a snapshot even if there are volumes derived from that snapshot. So that means I can simplify the procedure and remove the need for rsync.

Mentioned in SAL (#wikimedia-cloud-feed) [2025-11-07T11:33:27Z] <fnegri@cloudcumin1001> START - Cookbook wmcs.vps.remove_instance for instance tools-db-7 (T409287)

Mentioned in SAL (#wikimedia-cloud-feed) [2025-11-07T11:34:14Z] <fnegri@cloudcumin1001> END (PASS) - Cookbook wmcs.vps.remove_instance (exit_code=0) for instance tools-db-7 (T409287)

Mentioned in SAL (#wikimedia-cloud-feed) [2025-11-07T11:35:49Z] <fnegri@cloudcumin1001> START - Cookbook wmcs.vps.create_instance_with_prefix with prefix 'tools-db' (T409287)

Mentioned in SAL (#wikimedia-cloud-feed) [2025-11-07T11:42:41Z] <fnegri@cloudcumin1001> END (PASS) - Cookbook wmcs.vps.create_instance_with_prefix (exit_code=0) with prefix 'tools-db' (T409287)

Created new snapshot with the position below:

MariaDB [(none)]> SELECT @@gtid_current_pos;
+----------------------------------+
| @@gtid_current_pos               |
+----------------------------------+
| 2886729896-2886729896-6687065322 |
+----------------------------------+
1 row in set (0.000 sec)

Mentioned in SAL (#wikimedia-cloud-feed) [2025-11-07T11:45:56Z] <fnegri@cloudcumin1001> START - Cookbook wmcs.vps.refresh_puppet_certs on tools-db-7.tools.eqiad1.wikimedia.cloud (T409287)

Mentioned in SAL (#wikimedia-cloud-feed) [2025-11-07T11:47:29Z] <fnegri@cloudcumin1001> END (PASS) - Cookbook wmcs.vps.refresh_puppet_certs (exit_code=0) on tools-db-7.tools.eqiad1.wikimedia.cloud (T409287)

But maybe there are some good news: it looks like recent versions of Cinder improved the snapshot handling, I can now delete a snapshot even if there are volumes derived from that snapshot. So that means I can simplify the procedure and remove the need for rsync.

I updated the process in https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/ToolsDB#Creating_a_new_replica_host to remove the need for the temporary volume and rsync operation. This is no longer necessary and we can now create a new replica in minutes instead of hours!

tools-db-7 is now replicating from tools-db-6, it took about 20 minutes to catch up and is now in sync.

Screenshot 2025-11-07 at 13.27.57.png (311×622 px, 27 KB)

Next steps (I will do these next week):

Failing over now from tools-db-4 to tools-db-6, following https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/ToolsDB#Changing_a_Replica_to_Become_the_Primary

MariaDB [(none)]> STOP REPLICA;
Query OK, 0 rows affected (0.027 sec)

MariaDB [(none)]> RESET REPLICA ALL;
Query OK, 0 rows affected (0.054 sec)

MariaDB [(none)]> SHOW MASTER STATUS;
+------------+----------+--------------+------------------+
| File       | Position | Binlog_Do_DB | Binlog_Ignore_DB |
+------------+----------+--------------+------------------+
| log.045838 | 76406746 |              |                  |
+------------+----------+--------------+------------------+
1 row in set (0.000 sec)

MariaDB [(none)]> SELECT @@global.gtid_binlog_pos;
+----------------------------------+
| @@global.gtid_binlog_pos         |
+----------------------------------+
| 2886729896-2886729896-6769613942 |
+----------------------------------+
1 row in set (0.001 sec)

tools-db-6 is now the new primary.

Heartbeat enabled and working, I noticed the last position recorded in the hearbeat table (76360583) is slightly older than the last replicated position (76406746, seen in the comment above). This is expected because the heartbeat table is only updated every second, so some transactions were added after the last heartbeat. I have dropped the old server's line from the heartbeat table.

MariaDB [(none)]> SELECT * FROM heartbeat.heartbeat;
+----------------------------+------------+------------+----------+-----------------------+---------------------+---------+------------+
| ts                         | server_id  | file       | position | relay_master_log_file | exec_master_log_pos | shard   | datacenter |
+----------------------------+------------+------------+----------+-----------------------+---------------------+---------+------------+
| 2025-11-11T13:28:06.002880 | 2886729896 | log.103868 | 76360583 | NULL                  |                NULL | toolsdb | eqiad      |
| 2025-11-11T14:09:10.000520 | 2886731025 | log.045842 |  8616968 | NULL                  |                NULL | toolsdb | eqiad      |
+----------------------------+------------+------------+----------+-----------------------+---------------------+---------+------------+
2 rows in set (0.001 sec)

MariaDB [(none)]> DELETE FROM heartbeat.heartbeat WHERE server_id = 2886729896;
Query OK, 1 row affected (0.002 sec)

MariaDB [(none)]> SELECT * FROM heartbeat.heartbeat;
+----------------------------+------------+------------+----------+-----------------------+---------------------+---------+------------+
| ts                         | server_id  | file       | position | relay_master_log_file | exec_master_log_pos | shard   | datacenter |
+----------------------------+------------+------------+----------+-----------------------+---------------------+---------+------------+
| 2025-11-11T14:09:26.001410 | 2886731025 | log.045842 | 11713026 | NULL                  |                NULL | toolsdb | eqiad      |
+----------------------------+------------+------------+----------+-----------------------+---------------------+---------+------------+
1 row in set (0.000 sec)

investigate[0][1] what's in the big ibdata1 in tools-db-4

I tried using innochecksum but it's taking hours to complete:

root@tools-db-4:~# /opt/wmf-mariadb106/bin/innochecksum --no-check --write /srv/labsdb/data/ibdata1

innochecksum did complete eventually with some information:

fnegri@tools-db-4:~$ sudo /opt/wmf-mariadb106/bin/innochecksum --count /srv/labsdb/data/ibdata1
Number of pages:66831104
fnegri@tools-db-4:~$ sudo /opt/wmf-mariadb106/bin/innochecksum --page-type-summary /srv/labsdb/data/ibdata1
File::/srv/labsdb/data/ibdata1
================PAGE TYPE SUMMARY==============
#PAGE_COUNT     PAGE_TYPE
===============================================
  272720        Index page
56212649        Undo log page
  218982        Inode page
  139409        Insert buffer free list page
 9979067        Freshly allocated page
    4039        Insert buffer bitmap
     130        System page
       1        Transaction system page
       2        File Space Header
    4105        Extent descriptor page
       0        BLOB page
       0        Compressed BLOB page
       0        Page compressed page
       0        Page compressed encrypted page
       0        Other type of page

===============================================
Additional information:
Undo page type: 56212649
Undo page state: 1 active, 261 cached, 18619566 to_purge, 0 prepared, 37592821 other
index_id        #pages          #leaf_pages     #recs_per_page  #bytes_per_page
1               107             106             111             10522
2               469             468             181             11999
3               193             192             151             11368
4               161             160             257             11503
5               398             395             224             11368
11              40              39              58              8828
12              23              22              104             10343
13              19              18              121             11130
14              21              20              112             10667
15              58              57              176             11412
16              63              62              162             10669
3005178         1               1               15              600
-4294967296             269839          267680          94              6115

index_id        page_data_bytes_histgram(empty,...,oversized)
1               0       6       3       3       5       4       15      16      22      25      8       0
2               0       1       1       5       22      26      60      65      84      72      133     0
3               0       0       0       7       7       13      30      37      29      42      28      0
4               0       0       1       5       4       14      24      23      31      29      30      0
5               0       1       0       1       11      39      61      75      84      109     17      0
11              0       4       4       0       2       4       6       8       6       5       1       0
12              0       0       1       0       0       2       9       4       3       3       1       0
13              0       0       1       0       0       3       4       1       2       3       5       0
14              0       1       1       0       0       2       2       5       3       6       1       0
15              0       1       0       1       3       3       7       12      11      11      9       0
16              0       1       0       0       3       7       12      18      11      6       5       0
3005178         0       1       0       0       0       0       0       0       0       0       0       0
-4294967296             1       18463   25913   32915   42618   149923  0       0       0       0       6       0

The output from innochecksum is not very easy to read, but I think it confirms the theory that the increase in size was caused by a massive growth of undo logs, which was caused by a transaction that remained active for multiple days. More details about that in T409716: [toolsdb] ibdata1 growing on primary.

Mentioned in SAL (#wikimedia-cloud-feed) [2025-11-14T16:26:48Z] <fnegri@cloudcumin1001> START - Cookbook wmcs.vps.remove_instance for instance tools-db-4 (T409287)

Mentioned in SAL (#wikimedia-cloud-feed) [2025-11-14T16:27:43Z] <fnegri@cloudcumin1001> END (PASS) - Cookbook wmcs.vps.remove_instance (exit_code=0) for instance tools-db-4 (T409287)

I destroyed the tools-db-4 instance, the volume is still there. I will keep this task open until the volume is removed, and I also want to merge https://gitlab.wikimedia.org/repos/cloud/toolforge/tofu-provisioning/-/merge_requests/102 to decouple volumes from Tofu.

fnegri updated the task description. (Show Details)
fnegri moved this task from In progress to Done on the cloud-services-team (FY2025/26-Q1-Q2) board.
fnegri edited projects, added Toolforge (Toolforge iteration 25); removed Toolforge.

All done.