Page MenuHomePhabricator

ToolsDB disk space usage growing too fast
Closed, ResolvedPublic

Assigned To
Authored By
fnegri
Mon, May 25, 10:00 AM
Referenced Files
F86588316: Screenshot 2026-06-04 at 11.39.36.png
Thu, Jun 4, 10:06 AM
F86588272: Screenshot 2026-06-04 at 11.36.33.png
Thu, Jun 4, 10:06 AM
F86588239: Screenshot 2026-06-04 at 11.36.26.png
Thu, Jun 4, 10:06 AM
F86588223: Screenshot 2026-06-04 at 11.36.22.png
Thu, Jun 4, 10:06 AM
F86460430: Screenshot 2026-06-03 at 20.17.32.png
Wed, Jun 3, 6:54 PM
F86440400: Screenshot 2026-06-03 at 17.02.42.png
Wed, Jun 3, 3:03 PM
F86403619: Screenshot 2026-06-03 at 12.10.12.png
Wed, Jun 3, 10:12 AM
F84807178: Screenshot 2026-05-28 at 08.08.55.png
Thu, May 28, 6:32 AM

Description

The alert ToolsDBAlmostFull triggered yesterday, 2026-05-24.

Graphs show that space usage has started growing linearly on 2026-05-14:

Screenshot 2026-05-25 at 11.59.25.png (2,754×1,584 px, 354 KB)

This could be a repeat of T409716: [toolsdb] ibdata1 growing on primary. EDIT: it was a different issue, see comments below.

Details

Related Changes in Gerrit:

Event Timeline

fnegri changed the task status from Open to In Progress.Mon, May 25, 10:00 AM
fnegri triaged this task as High priority.
fnegri edited projects, added: tools-platform-team; removed: cloud-services-team.
fnegri moved this task from Backlog to In progress on the tools-platform-team board.
fnegri renamed this task from toolsdb disk space growing too fast to ToolsDB disk space growing too fast.Mon, May 25, 10:07 AM

Looking at the ToolsDB debugging dashboard, it's not a repeat of T409716: [toolsdb] ibdata1 growing on primary, because in that case "ToolsDB primary-replica size difference" would be growing.

Looking at the graph "deletes per day" seems to point out to a massive number of deletes causing a temporary increase of the binlog size. We keep binlogs for 14 days, so I expect disk space will start to free up 14 days after the spike in deletes ended on 2016-05-20.

Screenshot 2026-05-25 at 12.14.31.png (1,370×614 px, 86 KB)

But this does not explain why disk usage is still growing after the spike in deletes ended, it should have stabilized.

Screenshot 2026-05-25 at 12.19.36.png (1,362×606 px, 80 KB)

Space utilization right now:

fnegri@tools-db-6:~$ sudo du -hs /srv/labsdb/* |sort -hr |head -10
2.3T	/srv/labsdb/data
914G	/srv/labsdb/binlogs
20K	/srv/labsdb/tmp
16K	/srv/labsdb/lost+found
fnegri@tools-db-6:~$ sudo du -hs /srv/labsdb/data/* |sort -hr |head -10
281G	/srv/labsdb/data/s53220__quickstatements_p
273G	/srv/labsdb/data/s51434__mixnmatch_p
184G	/srv/labsdb/data/s53685__editgroups
173G	/srv/labsdb/data/s51698__yetkin
142G	/srv/labsdb/data/ibdata1
86G	/srv/labsdb/data/s51412__data
70G	/srv/labsdb/data/s51114__enwp10
63G	/srv/labsdb/data/s51156__petscan
58G	/srv/labsdb/data/s53952__freebase_p
57G	/srv/labsdb/data/s51499__wikiminiatlas

Binlogs are stored in 100M chunks. It looks like we are still generating more binlogs than usual even if the spike in "deletes per day" has ended. There might be other operations that are causing an increased binlog activity.

fnegri@tools-db-6:~$ sudo ls -lh /srv/labsdb/binlogs/ |awk '{print $6 $7}' |uniq -c
      1
    101 May11
    455 May12
    307 May13
    742 May14
    749 May15
    720 May16
    659 May17
    740 May18
    704 May19
    673 May20
    747 May21
    686 May22
    692 May23
    662 May24
    505 May25

The good news is that the number of binlog files generate daily is high but not increasing, so disk space usage should stabilize in a couple of days (when the 14-day retention window will start deleting binlogs from May14 and later days).

An update from today, binlogs are still growing but they should hopefully stop to grow in the next 24 hours as the retention window moves forward. The number of binlogs generated per day remains higher than usual (more than 600 files generated per day), we should find out what's causing this.

fnegri@tools-db-6:~$ sudo du -hs /srv/labsdb/* |sort -hr |head -10
2.3T	/srv/labsdb/data
1005G	/srv/labsdb/binlogs
20K	/srv/labsdb/tmp
16K	/srv/labsdb/lost+found
fnegri@tools-db-6:~$ sudo du -hs /srv/labsdb/data/* |sort -hr |head -10
281G	/srv/labsdb/data/s53220__quickstatements_p
274G	/srv/labsdb/data/s51434__mixnmatch_p
184G	/srv/labsdb/data/s53685__editgroups
174G	/srv/labsdb/data/s51698__yetkin
142G	/srv/labsdb/data/ibdata1
86G	/srv/labsdb/data/s51412__data
70G	/srv/labsdb/data/s51114__enwp10
63G	/srv/labsdb/data/s51156__petscan
58G	/srv/labsdb/data/s53952__freebase_p
57G	/srv/labsdb/data/s51499__wikiminiatlas
fnegri@tools-db-6:~$ sudo ls -lh /srv/labsdb/binlogs/ |awk '{print $6 $7}' |uniq -c
      1
    101 May13
    742 May14
    749 May15
    720 May16
    659 May17
    740 May18
    704 May19
    673 May20
    747 May21
    686 May22
    692 May23
    662 May24
    718 May25
    806 May26
    525 May27

Disk space has stabilized as I expected:

Screenshot 2026-05-28 at 08.08.55.png (1,058×646 px, 64 KB)

We're still seeing 700+ binlog files written every day, which is much more than usual. I scanned all the binlog files from yesterday (May 27) and the highest number of queries is for s51138__heritage_p. @Multichill @JeanFred was there any recent change to the heritage tool that could explain an increase in DB activity?

root@tools-db-6:~# for i in $(seq 128050 128717); do echo "log.$i:"; mysqlbinlog --base64-output=decode-rows --verbose /srv/labsdb/binlogs/log.$i |grep Table_map | awk '{print $11}' |uniq -c |sort |tail -n 3 |grep -E -- "^  [0-9]+"; done

[...]

log.128068:
  25825 `s51138__heritage_p`.`monuments_ir_(fa)`
log.128069:
log.128070:
log.128071:
log.128072:
log.128073:
  35264 `s51138__heritage_p`.`monuments_es-ct_(ca)`
log.128074:
log.128075:
  30549 `s51138__heritage_p`.`monuments_ro_(ro)`
log.128076:
log.128077:
log.128078:
  84298 `s51138__heritage_p`.`monuments_be-vlg_(fr)`

[...]

I'm actually not entirely convinced that s51138__heritage_p is the issue here. It does definitely do a lot of queries, but I have no evidence this number has increased compared to last month. We could also have some tool generating a small number of very intensive queries that will update a lot of rows, and that would also explain the increase in binlog size. This type of queries are unfortunately harder to find scanning the binlog files.

fnegri renamed this task from ToolsDB disk space growing too fast to ToolsDB disk space usage growing too fast.Thu, May 28, 1:08 PM

I tried to find queries affecting many rows, and I found that we're seeing millions of row updates per day on s51205__terminator_p.items (about 50k 1.5k UPDATEs per second), that might explain the growth of the space occupied by the binlogs.

That table belongs to the wikidata-terminator tool, @Magnus was there any recent change to that tool that would cause an increase in database activity? The increase in binlog size started on May 14th.

root@tools-db-6:~# for i in $(seq 132000 132041); do echo "log.$i:"; mysqlbinlog --base64-output=decode-rows --verbose /srv/labsdb/binlogs/log.$i |grep 'UPDATE `s5' | sort |uniq -c |sort |tail -n 3; done
log.132000:
   8307 ### UPDATE `s55543__xlinks_p`.`host`
  29998 ### UPDATE `s55078__wdrc_p`.`redirects`
 205000 ### UPDATE `s51205__terminator_p`.`items`
log.132001:
   5262 ### UPDATE `s51055__anomiebot`.`AnomieBOT_Store`
  30000 ### UPDATE `s55078__wdrc_p`.`redirects`
 119997 ### UPDATE `s51205__terminator_p`.`items`
log.132002:
   2195 ### UPDATE `s55902__xlinkstest_p`.`host`
  19998 ### UPDATE `s55078__wdrc_p`.`redirects`
 109996 ### UPDATE `s51205__terminator_p`.`items`
log.132003:
   1896 ### UPDATE `s55543__xlinks_p`.`host`
  20000 ### UPDATE `s55078__wdrc_p`.`redirects`
 125000 ### UPDATE `s51205__terminator_p`.`items`
log.132004:
   1641 ### UPDATE `s55543__xlinks_p`.`host`
  25000 ### UPDATE `s55078__wdrc_p`.`redirects`
 130000 ### UPDATE `s51205__terminator_p`.`items`
log.132005:
   2352 ### UPDATE `s55902__xlinkstest_p`.`host`
  24998 ### UPDATE `s55078__wdrc_p`.`redirects`
 170000 ### UPDATE `s51205__terminator_p`.`items`

[...]

An example query generating thousands of row updates is:

REPLACE INTO items (q,claims,identifiers,sitelinks,labels,descriptions,sites,score,random) VALUES (24703775,20,2,0,',en,',',en,ja,',',,',18,rand()), ...)

On 2026-06-02 disk usage started increasing again:

Screenshot 2026-06-03 at 12.10.12.png (1,076×614 px, 72 KB)

Scanning the binlogs I found high activity from the msynbot tool:

root@tools-db-6:~# mysqlbinlog --base64-output=decode-rows --verbose /srv/labsdb/binlogs/log.132815 |grep "### [A-Z]\+" |grep -v SET |grep -v WHERE |sort |uniq -c |sort | tail -n 3
  15000 ### UPDATE `s51205__terminator_p`.`items`
 657708 ### INSERT INTO `s53222__delsitelinks_ix`.`sitelinks`
1500000 ### INSERT INTO `s53222__delsitelinks_ix`.`pages`

I'll have a look at wikidata-terminator

Change #1297114 had a related patch set uploaded (by FNegri; author: FNegri):

[operations/puppet@production] toolsdb: reduce binlog retention

https://gerrit.wikimedia.org/r/1297114

Trying to fix update flood: https://codeberg.org/magnusmanske/wikidata-terminator/commit/18ce520293ad93078d38c8b63ba8268021a24483

No idea why it started recently though. Have a look the next few hours if anything improves

@magnusmanske thanks for looking! I'm not entirely sure if the number of updates by wikidata-terminator increased recently, or if they are the cause of the increase in binlogs we are seeing starting 2026-06-14. However it's definitely doing a lot of updates.

It's not the only database doing a lot of operations, so it's possible it's not the one that increased recently. Trying to get a better picture, these are the top 10 databases ordered by number of operations in the time range 11:00 UTC to 12:00 UTC today:

4022412 s51205__terminator_p
 933841 s53113__wscontest_p
 591713 s55078__wdrc_p
 557116 s52481__scratch_fr
 465923 s51434__mixnmatch_p
 305134 s53430__nppbrowser_p
 257982 s51138__heritage_p
 216282 s51290__dpl_p
 161137 s55902__xlinkstest_p
 157546 s55543__xlinks_p

The list was generated with this simple Python script: P93712.

Trying to fix update flood: https://codeberg.org/magnusmanske/wikidata-terminator/commit/18ce520293ad93078d38c8b63ba8268021a24483

This unfortunately won't make a difference, because we use row-based replication, so each single row that is updated will add a new line to the binlog, regardless if it's done via many UPDATE statements or via a single big UPDATE.

4 million updates in 1 hour seem quite a lot, especially because it seems to happen constantly and not just for a brief spike. Is that number of constant updates expected?

My tools:

  • s51205__terminator_p cause TBD
  • s55078__wdrc_p has a lot of updates by its nature, but it should clean up after itself, that is, the table(s) will not grow infinitely
  • s51434__mixnmatch_p should have lower usage except imports, but right now I'm running a cache warmer that should finish in a few days, keep an eye

As for wikidata-terminator, looks like it was trying to update from recent changes - since 2023...

I have changed that to today, let's see what happens

Change #1297114 merged by FNegri:

[operations/puppet@production] toolsdb: reduce binlog retention

https://gerrit.wikimedia.org/r/1297114

In https://gerrit.wikimedia.org/r/1297114 I reduced expire_logs_days from 14 days to 10 days, which gives us some breathing room while we continue to investigate:

Screenshot 2026-06-03 at 17.02.42.png (986×618 px, 55 KB)

No joy. I will deactivate the wikidata-terminator update script for now until I understand what's wrong.

No joy. I will deactivate the wikidata-terminator update script for now until I understand what's wrong.

Thanks, we have some room now that I reduced the log retention time, so please stop and start the update script as you need for testing.

If that is helpful, I was collecting some more stats just now:

# 14:00 to 15:00 UTC
     4217533 s51205__terminator_p

# 15:00 to 16:00 UTC
      460283 s51205__terminator_p

# 16:00 to 17:00 UTC
     1515909 s51205__terminator_p

I will check again tomorrow.

Unrelated but also contributing to the disk space growth, history length is growing too much:

Screenshot 2026-06-03 at 20.17.32.png (980×622 px, 77 KB)

I killed a bunch of stuck queries from user s52323 (dimastbkbot) and s53685 (editgroups). dimastbkbot is automatically restarting new queries which is likely preventing the undo log from getting purged.

Update: I rewrote parts of wikidata-terminator to be able to skip the giant update-happy table entirely. the table is already deleted, albeit there is a new, smaller one. [Starship Toopers meme] I'm doing my part!

fnegri closed this task as Resolved.EditedThu, Jun 4, 10:06 AM
fnegri moved this task from In progress to Done on the tools-platform-team board.

@magnusmanske thank you very much!

I added some more visualizations to the ToolsDB debugging dashboard that show a few things:

  • the disk space decrease starting on 2026-05-14 was clearly correlated with the number of "row updates", but there was initially no visible change in the number of SQL UPDATE commands
  • the further dip in disk space starting 2026-06-02 was instead correlated with both an increase in row updates and in the number of SQL UPDATE commands
  • things seem to be back to normal now and free disk space is slowly increasing (the big jump on 2026-06-03 is me changing expire_logs_days from 14 to 10 days)

Screenshot 2026-06-04 at 11.36.22.png (1,274×612 px, 73 KB)

Screenshot 2026-06-04 at 11.36.26.png (1,276×608 px, 119 KB)

Screenshot 2026-06-04 at 11.36.33.png (1,278×620 px, 99 KB)

There is one thing that is still not resolved which is the history length issue, it is still growing but it's probably unrelated:

Screenshot 2026-06-04 at 11.39.36.png (1,270×614 px, 89 KB)

I have opened T428139: [toolsdb] Transaction History Length growing too much to track the history length issue, this task is Resolved.