User Details
- User Since
- Jul 18 2022, 2:39 PM (202 w, 4 d)
- Availability
- Available
- IRC Nick
- dhinus
- LDAP User
- FNegri
- MediaWiki User
- FNegri-WMF [ Global Accounts ]
Yesterday
I have just stopped the mixnmatch "cache warmer", I think it's sufficient now, see if that makes a difference.
I tried two approaches to understand which user/db is driving the high rate of "Data Reads".
The history length started to increase around 2026-05-28, which coincides with this increase in InnoDB I/O:
The situation after a few hours has improved:
Thu, Jun 4
I suspect there is also an underlying issue that I haven't discovered yet, and the stuck queries from heritage and dimastbkbot might be a symptom rather than the root cause.
I also lowered idle_transaction_timeout to 60:
I have temporarily stopped the dimastbkbot tool. I sent an email to the maintainer and also posted a message to their user page linking to this Phabricator task.
I can see similar queries in the slow query log since last year (2025-11-18) so I'm not sure why the "Transaction History Length" has only started increasing now.
There is something that is compounding the effect of queries from s52323 (dimasitkbot), and it's queries from s51138 (heritage):
@magnusmanske thank you very much!
Wed, Jun 3
Unrelated but also contributing to the disk space growth, history length is growing too much:
No joy. I will deactivate the wikidata-terminator update script for now until I understand what's wrong.
I created T428087: [toolsdb] Add db-level and user-level monitoring to improve our monitoring.
I'm optimistically marking this as Resolved, as this should no longer be an issue now that ToolsDB is running MariaDB 10.6. I also removed the related notes from Wikitech.
DB storage is tracked in the subtask T291782
NFS storage does not seem to be an immediate issue
In https://gerrit.wikimedia.org/r/1297114 I reduced expire_logs_days from 14 days to 10 days, which gives us some breathing room while we continue to investigate:
@magnusmanske thanks for looking! I'm not entirely sure if the number of updates by wikidata-terminator increased recently, or if they are the cause of the increase in binlogs we are seeing starting 2026-06-14. However it's definitely doing a lot of updates.
On 2026-06-02 disk usage started increasing again:
Mon, Jun 1
There's a Phab for that! ™
Judging from this line it looks like we are deliberately not allowing PAWS to connect to ToolsDB.
So this task is to remove any unused certificates from modules/profile/files/ssl/ that are expired
Workaround: running toolforge_load_users_to_ldap.sh fixed it.
I tried to find queries affecting many rows, and I found that we're seeing millions of row updates per day on s51205__terminator_p.items (about 50k UPDATEs per second), that might explain the growth of the space occupied by the binlogs.
Fri, May 29
Took 70 minutes in total:
I'm not sure what's causing the query to be interrupted. We do have a default limit of 1 hour that can be changed with SET SESSION max_statement_time, but when you hit that limit, you get an explicit error:
Thu, May 28
I'm actually not entirely convinced that s51138__heritage_p is the issue here. It does definitely do a lot of queries, but I have no evidence this number has increased compared to last month. We could also have some tool generating a small number of very intensive queries that will update a lot of rows, and that would also explain the increase in binlog size. This type of queries are unfortunately harder to find scanning the binlog files.
Disk space has stabilized as I expected:
Wed, May 27
An update from today, binlogs are still growing but they should hopefully stop to grow in the next 24 hours as the retention window moves forward. The number of binlogs generated per day remains higher than usual (more than 600 files generated per day), we should find out what's causing this.
Happened following this commit https://gitlab.wikimedia.org/repos/cloud/toolforge/maintain-kubeusers/-/commit/9b51acbff6ecb7db733f289d996a517a6f56c596
Mon, May 25
Binlogs are stored in 100M chunks. It looks like we are still generating more binlogs than usual even if the spike in "deletes per day" has ended. There might be other operations that are causing an increased binlog activity.
Looking at the ToolsDB debugging dashboard, it's not a repeat of T409716: [toolsdb] ibdata1 growing on primary, because in that case "ToolsDB primary-replica size difference" would be growing.
Fri, May 22
there's something wrong with the looping logic
All hosts have been upgraded and rebooted.
+1
I did run the cookbook with test-cookbook on 3 more hosts and it worked fine (clouddb1015, clouddb1016, clouddb1017).
Agreed, I was thinking of testing another shutdown in one or two weeks from now, and check if it gets stuck again. If it doesn't, I will mark it as resolved.
The cookbook for clouddb1017 took longer than expected because of T427060: clouddb1017 getting stuck during shutdown. It eventually completed, but failed on the last step because replication lag was not zero:
There's a threaddump in /root/shutdown-threaddump if you want to have a look, it was taken before the SIGABRT.
Rebooted the host and restarted mariadb, this is the startup log:
@fgiunchedi attempted a SIGABRT that resulted in:
Yes it's not the first time this happens, I created a task to have a papertrail and see if we can find ways to prevent it.
Thu, May 21
I gave a shot at adapting the cookbook to support clouddbs. I tested the patch above on clouddb1014 and it worked, you can see the full output in the Paste below:
Wed, May 20
clouddb1015 is running on Trixie and repooled.
The cookbook did actually PASS, but an exception was raised while writing the PASS comment (that you can see above), which caused the following FAIL message to be posted.
wikireplicas-utils was also missing in trixie, in this case a simple copy worked:
I found the source at https://gerrit.wikimedia.org/r/q/project:operations/debs/wmf-pt-kill but I don't know what procedure should be followed for building it, could you or @Marostegui please rebuild the package for trixie?
There's a broken dependency:
fnegri@apt1002:~$ sudo -i reprepro copy trixie-wikimedia bookworm-wikimedia wmf-pt-kill
Reimage completed, Mariadb is running, but puppet is failing with:
I'm reimaging clouddb1015 to trixie today, sorry for the delay.
@Raymond_Ndibe the draft looks good, I'm ok with sending it today. A few minor fixes:
- with toolforge -> with Toolforge
- toolforge jobs -> I would wrap it in quotes to clarify it's a cli command, "toolforge jobs"
- temporal -> temporary
- line length is not consistent, please use the same length for all lines
Mon, May 18
I would like to test this assumption with this patch that updates our config to use the latest available version of heroku:24, both for default builds and for builds using --use-latest-versions.
Verified on lima-kilo on Linux, nuked the VM when ./start-devenv.sh asked and ran the verification commands
Side-note: Heroku automatically restarts all containers every 24 hours, so any update to the stacks are rolled out automatically:
Fri, May 15
After merging all the patches above, things are working fine on my machine.
Thu, May 14
Re-deploying istio-system usually does not help, since that is a no-op. Today it did, because it made changes to the pod definitions, which triggered the deployment to get re-created.
@taavi after reading https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/IstioGatewayPodMisplaced, my current understanding is:
- whenever we deploy the "istio-gateway" component (which is not often), there’s a possibility that the new pods are misplaced
- misplaced istio-gateway pods can cause all tools to become unreachable
- redeploying istio-gateway won’t help, because that only updates the ConfigMap. The only way to fix it is to redeploy istio-system (like I did today), or manually delete the misplaced pod (as recommended in the runbook above)
This alert recovering thanks to the re-deployment did
The Helm diff above is because this branch was deployed in toolsbeta a few days ago: https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-deploy/-/merge_requests/1235/diffs
Redeploying istio-system seems to have fixed it, there was an unexpected Helm difference:

