Page MenuHomePhabricator

[toolsdb] ibdata1 growing on primary
Closed, ResolvedPublic

Description

On the primary host tools-db-4 the file ibdata1 has been growing quickly and caused T409244: Toolforge outage: toolsdb out of space.

After the disk size was increased, the size of ibdata1 remained stable for a few days at 985G, but is now growing again and is currently at 1007G.

Screenshot 2025-11-10 at 12.35.04.png (710×1 px, 53 KB)

A useful explanation of what can cause ibdata1 to grow:
https://www.percona.com/blog/why-is-the-ibdata1-file-continuously-growing-in-mysql/

One possible explanation is that the growth is caused by the "undo logs" generated by some specific user queries running on the db.

EDIT: this part from the Percona blog above turned out to be exactly our situation (more details in the comments below):

This is the most common reason, a pretty old transaction [...] The status is ACTIVE, that means InnoDB has created a snapshot of the data so it needs to maintain old pages in undo to be able to provide a consistent view of the database since that transaction was started. If your database is heavily write loaded that means lots of undo pages are being stored.

Event Timeline

fnegri changed the task status from Open to In Progress.Nov 10 2025, 11:38 AM
fnegri triaged this task as High priority.

Looking at SHOW ENGINE INNODB STATUS I can see some slow transactions with a big "undo log". These transactions seem to last a few minutes and then are replaced by a different transaction.

---TRANSACTION 158141880893, ACTIVE 749 sec
4 lock struct(s), heap size 1128, 1 row lock(s), undo log entries 7771
MariaDB thread id 37634508, OS thread handle 140100130969280, query id 985620608 172.16.5.125 s51138
---TRANSACTION 158143750499, ACTIVE 729 sec
40 lock struct(s), heap size 8312, 25 row lock(s), undo log entries 19937
MariaDB thread id 37972638, OS thread handle 140099791206080, query id 995197960 172.16.5.125 s51138
---TRANSACTION 158144557921, ACTIVE 115 sec
1 lock struct(s), heap size 1128, 0 row lock(s), undo log entries 14596
MariaDB thread id 38058200, OS thread handle 140099816253120, query id 997108294 172.16.5.125 s51138

They're all on db s51138. SHOW PROCESSLIST is showing only an idle thread on that db:

| 38058200 | s51138          | 172.16.5.125:54381  | s51138__heritage_p              | Sleep       |      0 |                                                               | NULL                                                                                                 |    0.000 |

But the thread id 38058200 matches the currently active transaction shown by SHOW ENGINE INNODB STATUS.

The binary logs confirm there's ongoing heavy activity on the s51138 database, belonging to the heritage tool:

root@tools-db-4:/srv/labsdb/binlogs# for i in $(seq 240 250); do echo -n "$i: "; mysqlbinlog --base64-output=decode-rows --verbose log.103$i |grep -c s51138 ; done
240: 262
241: 84016
242: 70054
243: 1
244: 0
245: 0
246: 0
247: 0
248: 0
249: 0
250: 54506

root@tools-db-4:/srv/labsdb/binlogs# mysqlbinlog --base64-output=decode-rows --verbose log.103250 |grep s51138 |grep -v "mapped" |sort | uniq
### INSERT INTO `s51138__heritage_p`.`monuments_am_(hy)`
### INSERT INTO `s51138__heritage_p`.`monuments_cm_(fr)`
### UPDATE `s51138__heritage_p`.`monuments_am_(hy)`
### UPDATE `s51138__heritage_p`.`monuments_cm_(fr)`

@JeanFred @Multichill it looks like the heritage tool is using very long transactions on ToolsDB (see the one below that has been active for 1467 seconds) and I suspect this is causing MariaDB to grow the disk space used for "undo logs". Did you change something recently in your tool that could explain this behaviour? The disk size has been increasing for the past 30 days, unfortunately I don't have older stats to verify when this issue started exactly.

---TRANSACTION 158144557921, ACTIVE 1467 sec
1 lock struct(s), heap size 1128, 0 row lock(s), undo log entries 143647
MariaDB thread id 38058200, OS thread handle 140099816253120, query id 1001819546 172.16.5.125 s51138

This is another very long active transaction, this time for tool kiranbot4, that might be causing issues:

---TRANSACTION (0x7f739e977e80), ACTIVE 455393 sec
0 lock struct(s), heap size 1128, 0 row lock(s)
MariaDB thread id 13106, OS thread handle 140100661339840, query id 1039210877 172.16.1.100 s54990
Trx read view will not see trx with id >= 157913784979, sees < 157913784979

The thread id for the kiranbot4 transaction is on Sleep:

|    13106 | s54990          | 172.16.1.100:60640  | s54990__ampurls                 | Sleep       |     10 |                                                               | NULL                                                                                                 |    0.000 |

The "Time" usually resets after a few seconds (maybe a new query is run from the same connection?), but the transaction ACTIVE time never reset.

I tried killing the connection with KILL 13106; and that did terminate the long transaction. But a new one was created that is also remaining ACTIVE and never terminating:

---TRANSACTION (0x7f739e9b6780), ACTIVE 81 sec
0 lock struct(s), heap size 1128, 0 row lock(s)
MariaDB thread id 40105789, OS thread handle 140099684251328, query id 1068984619 172.16.1.100 s54990
Trx read view will not see trx with id >= 158154710030, sees < 158154710030

@Usernamekiran I see that you are listed as the maintainer for kiranbot4, it looks like the tool creates a SQL transaction on ToolsDB that is never released (it was showing as ACTIVE since 5 days ago). Can you please check if this can be fixed?

@Usernamekiran I added autocommit=True to the db_connection in enwiki/amp/amp_rc.py, and restarted the amp-rc job, and the tool is no longer holding an open transaction. It's a Toolforge continuous job that runs a loop waiting for "recent changes" on wiki pages, so it never terminates, and keeps an open connection to the database.

@JeanFred @Multichill it looks like the heritage tool is using very long transactions on ToolsDB (see the one below that has been active for 1467 seconds) and I suspect this is causing MariaDB to grow the disk space used for "undo logs". Did you change something recently in your tool that could explain this behaviour? The disk size has been increasing for the past 30 days, unfortunately I don't have older stats to verify when this issue started exactly.

@JeanFred @Multichill after further investigation, I think the impact of your tool on disk space for ToolsDB was minimal, if any. The main reason for the disk usage seems to be the separate tool kiranbot4 (see comments above). I would still recommend avoiding long transactions, but it's not an urgent fix.

I am also looking for ways to automatically terminate long transactions on the server side: T409857: [toolsdb] Automatically terminate long transactions.

ibdata1 is slowly growing again, it went from 115G to 134G.

@Usernamekiran I added autocommit=True to the db_connection in enwiki/amp/amp_rc.py, and restarted the amp-rc job, and the tool is no longer holding an open transaction.

This change was undone and kiranbot4 is again holding transactions for multiple days:

---TRANSACTION (0x7fd76756dc80), ACTIVE 480405 sec
0 lock struct(s), heap size 1128, 0 row lock(s)
MariaDB thread id 55516093, OS thread handle 140528514496192, query id 3398476324 172.16.6.4 s54990
Trx read view will not see trx with id >= 157665554175, sees < 157665544190

I killed this thread and I tried setting SET GLOBAL idle_transaction_timeout=300; to see if it helps, although if the tool is making at least one query every 5 minutes, it will never reach the timeout and we'll need a different approach, see T409857: [toolsdb] Automatically terminate long transactions.

Worth noting that this time the multi-day transaction caused a smaller increase to the size of ibdata1, compared to the similar transaction a few weeks ago. This is probably because in the past few days the concurrent write activity on the same tables was smaller. Only queries that would affect the transaction's isolation level are written to the undo log, and cause an increase to the size of ibdata1.

An interesting video on the internals of InnoDB and undo logs: https://www.youtube.com/watch?v=RnpSQdS1rG8

I check that email every day, for the week I didn't check, this happened.

I had seen "autocommit=True" in my code, but I thought I accidentally added it myself, so removed it. I will fix it in couple of hours. I will update here when I do it. I apologise for the inconvenience caused.

I added autocommit=True back to the program a few hours ago. I hope this resolves the issue.

I check that email every day, for the week I didn't check, this happened.

LOL! 😆

I added autocommit=True back to the program a few hours ago. I hope this resolves the issue.

Thanks, things are looking good, I no longer see open transactions and the space is not going down.

I will keep this task open for a few more days, because I'm not 100% sure if kiranbot4 was the only reason that caused ibdata1 to grow.

I created a new debugging dashboard in grafana: https://grafana.wmcloud.org/d/5968f014-0a22-4f48-b6dc-6d1c46636f80/fnegri-toolsdb-debugging

It shows both the history length (size of the undo log) and the difference in disk size between the primary and the replica, which is an approximation of the growth of ibdata1 on the primary:

Screenshot 2025-11-25 at 18.16.21.png (1×1 px, 131 KB)

Please note that ibdata1 cannot decrease in size, but the space in it can be reused after old undo logs are purged.

In the past 2 days, while there was some expected fluctuation in the undo log size, the size difference has remained constant, as the ibdata1 log has stopped growing.

History Length is still growing almost linearly, altough ibdata1 is constant at 142G but I worry it will eventually start growing if the history is not purged:

Screenshot 2025-11-27 at 11.52.02.png (1×1 px, 172 KB)

I suspect this time it's this expensive query on s51434 (mix-n-match tool, cc @Magnus):

fnegri@tools-db-6:~$ sudo tail /srv/labsdb/data/tools-db-6-slow.log
# Thread_id: 8419404  Schema: s51434__mixnmatch_p  QC_hit: No
# Query_time: 3600.302650  Lock_time: 0.000235  Rows_sent: 0  Rows_examined: 148860511
# Rows_affected: 0  Bytes_sent: 0
SET timestamp=1764234692;
UPDATE entry e1
                    INNER JOIN person_dates p1 ON p1.entry_id=e1.id AND p1.year_born IS NOT NULL
                    INNER JOIN vw_dates p2 ON p2.ext_name=e1.ext_name AND p2.year_born=p1.year_born AND p2.q IS NOT NULL AND p2.user>0 AND p2.entry_id!=e1.id AND p2.user IS NOT NULL
                    SET e1.q=p2.q,e1.user=0,timestamp=date_format(now(),'%Y%m%d%H%i%S')
                    WHERE e1.type='Q5' AND e1.catalog=3849 AND (e1.q is null or e1.user=0)
                    AND NOT EXISTS (SELECT * FROM log WHERE log.entry_id=e1.id AND log.q=p2.q);

The slow-log shows this query is running every hour and always timing out because it reaches max_statement_time. The first entry in the slow-log is from Nov, 18th.

@fnegri I have deactivated the query in code, and removed the option to start it.

Just to be sure, would it be helpful if I stop my operations for a few days? That might clear up if kiranbot4 is causing the growth issue. It will help with narrowing down the possible root causes.

Just to be sure, would it be helpful if I stop my operations for a few days?

@Usernamekiran Thanks for the offer, but right now I don't see any evidence that your tool is affecting either the history length or the size of ibdata1, so please keep it running for now.

The situation seems to be mostly fine at the moment, but I want to monitor the dashboard for a few days, before marking this task as Resolved.

Something happened last night that caused the history length to grow again, but it's already going down, so whatever query caused this has probably completed by now. The slow-log does not contain any query longer than 30mins in the past 24 hours. I will keep monitoring this.

Screenshot 2025-11-28 at 15.30.30.png (1×1 px, 202 KB)

fnegri moved this task from In progress to Done on the cloud-services-team (FY2025/2026-Q1-Q2) board.

The transaction history length is still having spikes, but is no longer growing out of control. The size difference between the primary and the replica is more or less stable since 2025-11-22, when the change to kiranbot4 was deployed.

Screenshot 2025-12-02 at 16.11.56.png (1×1 px, 158 KB)

The size of ibdata1 is also stable at 142G:

fnegri@tools-db-6:~$ ls -lh /srv/labsdb/data/ibdata1
-rw-rw---- 1 mysql mysql 142G Dec  2 15:09 /srv/labsdb/data/ibdata1

I will mark this as Resolved. Some fluctuations of the history length are to be expected with so many tools using ToolsDB, and we cannot track down every problematic transaction. Better long-term strategies are T409857: [toolsdb] Automatically terminate long transactions and T291782: Migrate largest ToolsDB users to Trove.