Page MenuHomePhabricator

20250901 enwiki dump is triple the normal size
Closed, ResolvedPublicBUG REPORT

Assigned To
Authored By
OloffTheMeta
Sep 4 2025, 11:44 PM
Referenced Files
F66026004: image.png
Sep 17 2025, 10:23 AM
F66023663: image.png
Sep 15 2025, 4:41 PM
F66023660: image.png
Sep 15 2025, 4:41 PM
F66021929: rebuild-enwiki-20250901-pages-articles-multistream.tar.gz
Sep 14 2025, 8:46 PM
F66021592: enwiki-20250901-broken-hashsums.tar.gz
Sep 14 2025, 11:54 AM
F66018356: image.png
Sep 12 2025, 4:35 PM
F66018081: image.png
Sep 12 2025, 3:10 PM
F66017621: image.png
Sep 12 2025, 1:02 PM

Description

Steps to replicate the issue (include links if applicable):

Event Timeline

I noticed this issue the other day while downloading various non-English dumps, and I'd like to add a few other symptoms I've noticed:

  • The MD5 and SHA-1 checksum files are completely the wrong sizes for an enwiki dump. I'm seeing 23KB for the SHA-1 (135KB in 2025-08-01) and just 476 bytes (124KB in 2025-08-01) for the MD5. This persists when accessed through two different UK ISPs (One cellular) as well.
  • OloffTheMeta reports the file as being 84,9GB yet when I've looked at it (Again, through two connections on separate ISPs) it shows for me as 74,9GB, though this disparity could be a typo. Making a head request via wget gives an exact length (As reported by the dumps server) of 80.404.958.501 bytes.
  • Unlike the 18 non-English dumps I've downloaded this month, enwiki is persistently 404'ing on the mirror I use; mirror.accum.se. Whatever happened on the dumps server presumably gave accum,se a 404 when they (r)sync'd the September dumps to itself, which might help with timing.
  • The length of the non-multistream enwiki-20250901-pages-articles.xml.bz2 (23GB, 24.373.087.823 bytes) reported through head requests appears sane, but like the multistream dump this is also 404'ing on accum,se.
  • The dump status for enwiki is still showing Recombine logs and All pages w/complete edit history (bz2 & 7z) as pending, so I'm guessing a process has crashed out badly (By the looks of things; Log events is stuck at in progress so I'd say during log processing) but a management script is still running without realising the process has failed.
  • The latest timestamp I see on the page is 2025-09-02 21:22 (20250902T212204Z) so that's where I'd start looking if I had access to the server logs.

The quickest solution might be to kill the dump process and leave it until the next scheduled dump (Perhaps keep the non-multistream dump online?) or see if it's possible to fork a multistream version off the non-multistream one. Unfortunately I'm on a cellular connection with a data quota too small to download the enwiki dump tonight, otherwise I'd have got on it as soon as I'd noticed the non-multi was a sane size and presumably undamaged.
ᛒᚱᛟᚴᛂᚾ ᚢᛁᚴᛁᚾᚷ (Broken Viking)

Screenshot of what's grabbing my attention attached:

Bildschirmfoto_2025-09-09_02-52-59_enwiki-tasks-still-pending.png (513×885 px, 84 KB)

That was a typo! I also see 74,9GB

That was a typo! I also see 74,9GB

Aye, typos happen! What's worrying me about this bug is what it might be doing for WMFs data export bills. enwiki is probably the most downloaded dump of the lot (Followed by Spanish, German and Russian in that order, if the stats for the torrents I made of the 2025-06-01 dumps are anything to go by) and though we can see that file size indicates a likely problem, chances are 99% of users aren't spotting it at all - Even if they've been pulling the dumps and are used to them being ~25GiB-ish.

I see that the non-multistream dump seems to be a sane size (22,7GiB), and I might be able to fork a new multistream version off of that if I can work out some essentials not yet in my skill-set, but would it be worth doing this when the next dump window is about a week away? I'm limited on access here for my capped cellular connection, and all of the work I've been doing relies heavily on Sneakernet and a friends unlimited cable connection (Plus being able to employ IA for web-seeding) and access to this is rather dependent on my health, which has been in some decline over the past six months.

As it is; From some looking around Phabricator it looks like a new framework/mechanism is being bought in for Wikipedia dumps, but it seems to be proving troublesome (See T402626) so we might want to keep hold of the known good dumps we have for the moment just in case dumps cease to be available, period.

Are you still looking to handle the English language dump, or do you want me to get that IA'd and seeded from my end? (It'll be the non-multi for speed; Forking the multi from it will take some time and a lot of trial and error. I might have a look to see if the multistream build files are sane sizes, and if so they might just need some dding together. :-)

Addendum: Through a bit of terminal monkeying and some HEAD requests just now it appears that the sum total of all of the multistream component files shown on the enwiki-20250901 dump page (All 69 of them) total 23,73 GiB (25.481.773.933 bytes) which feels a very sane size for this months enwiki MS dump when last months MS was about 130MB smaller. I think it could be recreated by just downloading all of the archive and index file components, then dd'ing them together with the following commands;

dd if=./enwiki-20250901-pages-articles-multistreamX.xml-[//Component//].bz2 of=./enwiki-20250901-pages-articles-multistream-rebuilt.xml.bz2 iflag=fullblock oflag=append conv=notrunc

dd if=./enwiki-20250901-pages-articles-multistream-indexX.txt-[//Component//].bz2 of=./enwiki-20250901-pages-articles-multistream-rebuilt-index.txt.bz2 iflag=fullblock oflag=append conv=notrunc

I can do this if that's easier, but it'll need a few days; I have over 40GiB of data still buffered for uploading to Archive,org before I can pull the components (Which'll be using the disk space freed-up by the non-English wikis) and with there being many mirrors for the 2025-08-20 dump I'd like to get the other languages uploaded and shared first. ^_^

ᛒᚱᛟᚴᛂᚾ ᚢᛁᚴᛁᚾᚷ (Broken Viking)

Hi and thanks @OloffTheMeta and @Broken-Viking for reporting this issue and for providing such a lot of details about your findings. We will look into this issue now.

From some looking around Phabricator it looks like a new framework/mechanism is being bought in for Wikipedia dumps, but it seems to be proving troublesome (See T402626) so we might want to keep hold of the known good dumps we have for the moment just in case dumps cease to be available, period.

I can supply some context to you around this and hopefully allay some of your concerns. You're right that there have been some significant changes to the way dumps works in the last 9 months.
We have migrated the dumps processes from bare-metal servers to Kubernetes in T352650: WE 5.4 KR - Hypothesis 5.4.4 - Q3 FY24/25 - Migrate current-generation dumps to run on kubernetes.
This has brought about many benefits to the way that the dumps are managed and monitored because we have switched to using Airflow as the scheduler component.
Screenshot of Airflow managing dumps, just for reference:

image.png (949×1 px, 184 KB)

But you're right, there have been a few teething troubles since the new system went live on July 1st (T397848}.

I can assure you, however, that WMF is still very much committed to creating and publishing dumps, so they are not going to disappear from our servers.
That said, we are also very grateful to you for any and all additional torrenting and uploading to archive.org for these files. From our perspective, the more backups and mirrors, the better.

What's worrying me about this bug is what it might be doing for WMFs data export bills

That's good of you to flag the bandwidth concerns for us. In this case, I don't think that the increased file sizes will cause a particular concern to our traffic team, as the dumps file downloads are still a relatively small proportion of our overall network egress traffic from this data centre.
Also, our architecture choices mean that we do not pay a metered charge for bandwidth use.

I think that the more concerning aspects of your repor are those around data quality. e.g.

The MD5 and SHA-1 checksum files are completely the wrong sizes for an enwiki dump.

This is pretty serious, so we need to investigate whether there is a more systemic issue around how our Airflow pipelines generate the checksum files, as well as the multistream file generation process.
If necessary, we will define this as a Data Incident and write it up accordingly.

Finally, you might also want to know about some future improvements to the dumps processes, which are not far away. You can read about this here: T366752: Dumps 2.0 Phase III: Production level dumps (SDS 1.2)

Effectively, these two part of the dumps will be updated:

  • pages-meta-current
  • pages-meta-history

When this change goes into production, these two sets of dumps files will be generated by a modern distributed data pipeline using Spark and Iceberg, rather than using a legacy mediawiki maintenance scripts in PHP.
This will be faster and more reliable, but we are currently working on how to integrate this with the current dumps UI here: {T400507}.

I hope that's of some help and/or interest. I will start looking now at why these multistream dumps and the checksums seem to be so broken and whether it affects any other wikis.

BTullis triaged this task as High priority.Sep 12 2025, 1:03 PM

Initial inspection suggests that this might be related to the fact that the dump_articlesmultistreamdumprecombine_full job ran tour times.

image.png (936×1 px, 223 KB)

I checked to make sure that the consituent files that are concatenated to produce the pages-articles-multistream.xml.bz2 and pages-articles-multistream-index.txt.bz2 files.

btullis@clouddumps1002:/srv/dumps/xmldatadumps/public/enwiki$ find ./20250801 -type f -name '*multistream-index[0-9]*.bz2' -exec du -ch {} + | grep total
261M	total
btullis@clouddumps1002:/srv/dumps/xmldatadumps/public/enwiki$ find ./20250801 -type f -name '*multistream-index[0-9]*.bz2' -exec du -ch {} + | wc -l
70

btullis@clouddumps1002:/srv/dumps/xmldatadumps/public/enwiki$ find ./20250901 -type f -name '*multistream-index[0-9]*.bz2' -exec du -ch {} + | grep total
262M	total
btullis@clouddumps1002:/srv/dumps/xmldatadumps/public/enwiki$ find ./20250901 -type f -name '*multistream-index[0-9]*.bz2' -exec du -ch {} + | wc -l
70

btullis@clouddumps1002:/srv/dumps/xmldatadumps/public/enwiki$ find ./20250801 -type f -name '*multistream[0-9]*.bz2' -exec du -ch {} + | grep total
24G	total
btullis@clouddumps1002:/srv/dumps/xmldatadumps/public/enwiki$ find ./20250801 -type f -name '*multistream[0-9]*.bz2' -exec du -ch {} + | wc -l
70

btullis@clouddumps1002:/srv/dumps/xmldatadumps/public/enwiki$ find ./20250901 -type f -name '*multistream[0-9]*.bz2' -exec du -ch {} + | wc -l
70
btullis@clouddumps1002:/srv/dumps/xmldatadumps/public/enwiki$ find ./20250901 -type f -name '*multistream[0-9]*.bz2' -exec du -ch {} + | grep total
24G	total

They're all fine.

So I suspect that what's happening is that when the multistreamdumprecombine jobs were running, they crashed out with an oom error, but left a temporary file.
Each time the job restarted, lbzip2 started concatenating the constituent files to the existing temp file, rather than starting again.

So it's not lbzip2 that recombines them, that's just the multi-threaded compression/decompression utility. It turns out that we just use dd to contatenate these files.

Here is what the first job run of the enwiki.dump_articlesmultistreamdumprecombine_full job did, when it got to the heart of it.

[2025-09-02, 17:01:33 UTC] {pod_manager.py:536} INFO - [mediawiki-dump-sql-xml] command /bin/dd if=/mnt/dumpsdata/xmldatadumps/public/enwiki/20250901/enwiki-20250901-pages-articles-multistream1.xml-p1p41242.bz2 of=/mnt/dumpsdata/xmldatadumps/public/enwiki/20250901/enwiki-20250901-pages-articles-multistream.xml.bz2.inprog skip=0 count=575 iflag=skip_bytes,count_bytes oflag=append conv=notrunc bs=256k (29) started...
[2025-09-02, 17:01:33 UTC] {pod_manager.py:536} INFO - [mediawiki-dump-sql-xml] returned from 29 with 0
[2025-09-02, 17:01:33 UTC] {pod_manager.py:536} INFO - [mediawiki-dump-sql-xml] command /bin/dd if=/mnt/dumpsdata/xmldatadumps/public/enwiki/20250901/enwiki-20250901-pages-articles-multistream1.xml-p1p41242.bz2 of=/mnt/dumpsdata/xmldatadumps/public/enwiki/20250901/enwiki-20250901-pages-articles-multistream.xml.bz2.inprog skip=575 count=291576064 iflag=skip_bytes,count_bytes oflag=append conv=notrunc bs=256k (31) started...
[2025-09-02, 17:01:33 UTC] {pod_manager.py:536} INFO - [mediawiki-dump-sql-xml] returned from 31 with 0
[2025-09-02, 17:01:33 UTC] {pod_manager.py:536} INFO - [mediawiki-dump-sql-xml] command /bin/dd if=/mnt/dumpsdata/xmldatadumps/public/enwiki/20250901/enwiki-20250901-pages-articles-multistream2.xml-p41243p151573.bz2 of=/mnt/dumpsdata/xmldatadumps/public/enwiki/20250901/enwiki-20250901-pages-articles-multistream.xml.bz2.inprog skip=575 count=391399096 iflag=skip_bytes,count_bytes oflag=append conv=notrunc bs=256k (32) started...
[2025-09-02, 17:01:33 UTC] {pod_manager.py:536} INFO - [mediawiki-dump-sql-xml] returned from 32 with 0
[2025-09-02, 17:01:33 UTC] {pod_manager.py:536} INFO - [mediawiki-dump-sql-xml] command /bin/dd if=/mnt/dumpsdata/xmldatadumps/public/enwiki/20250901/enwiki-20250901-pages-articles-multistream3.xml-p151574p311329.bz2 of=/mnt/dumpsdata/xmldatadumps/public/enwiki/20250901/enwiki-20250901-pages-articles-multistream.xml.bz2.inprog skip=575 count=422453617 iflag=skip_bytes,count_bytes oflag=append conv=notrunc bs=256k (33) started...
[2025-09-02, 17:01:33 UTC] {pod_manager.py:536} INFO - [mediawiki-dump-sql-xml] returned from 33 with 0
[2025-09-02, 17:01:33 UTC] {pod_manager.py:536} INFO - [mediawiki-dump-sql-xml] command /bin/dd if=/mnt/dumpsdata/xmldatadumps/public/enwiki/20250901/enwiki-20250901-pages-articles-multistream4.xml-p311330p558391.bz2 of=/mnt/dumpsdata/xmldatadumps/public/enwiki/20250901/enwiki-20250901-pages-articles-multistream.xml.bz2.inprog skip=575 count=473321304 iflag=skip_bytes,count_bytes oflag=append conv=notrunc bs=256k (34) started...
[2025-09-02, 17:01:33 UTC] {pod_manager.py:536} INFO - [mediawiki-dump-sql-xml] returned from 34 with 0
[2025-09-02, 17:01:33 UTC] {pod_manager.py:536} INFO - [mediawiki-dump-sql-xml] command /bin/dd if=/mnt/dumpsdata/xmldatadumps/public/enwiki/20250901/enwiki-20250901-pages-articles-multistream5.xml-p558392p958045.bz2 of=/mnt/dumpsdata/xmldatadumps/public/enwiki/20250901/enwiki-20250901-pages-articles-multistream.xml.bz2.inprog skip=575 count=507538152 iflag=skip_bytes,count_bytes oflag=append conv=notrunc bs=256k (35) started...
[2025-09-02, 17:01:33 UTC] {pod_manager.py:536} INFO - [mediawiki-dump-sql-xml] returned from 35 with 0
[2025-09-02, 17:01:33 UTC] {pod_manager.py:536} INFO - [mediawiki-dump-sql-xml] command /bin/dd if=/mnt/dumpsdata/xmldatadumps/public/enwiki/20250901/enwiki-20250901-pages-articles-multistream6.xml-p958046p1483661.bz2 of=/mnt/dumpsdata/xmldatadumps/public/enwiki/20250901/enwiki-20250901-pages-articles-multistream.xml.bz2.inprog skip=575 count=542867352 iflag=skip_bytes,count_bytes oflag=append conv=notrunc bs=256k (36) started...
[2025-09-02, 17:01:33 UTC] {pod_manager.py:536} INFO - [mediawiki-dump-sql-xml] returned from 36 with 0
[2025-09-02, 17:01:33 UTC] {pod_manager.py:536} INFO - [mediawiki-dump-sql-xml] command /bin/dd if=/mnt/dumpsdata/xmldatadumps/public/enwiki/20250901/enwiki-20250901-pages-articles-multistream7.xml-p1483662p2134111.bz2 of=/mnt/dumpsdata/xmldatadumps/public/enwiki/20250901/enwiki-20250901-pages-articles-multistream.xml.bz2.inprog skip=575 count=555565373 iflag=skip_bytes,count_bytes oflag=append conv=notrunc bs=256k (37) started...
[2025-09-02, 17:01:33 UTC] {pod_manager.py:536} INFO - [mediawiki-dump-sql-xml] returned from 37 with 0
[2025-09-02, 17:01:33 UTC] {pod_manager.py:536} INFO - [mediawiki-dump-sql-xml] command /bin/dd if=/mnt/dumpsdata/xmldatadumps/public/enwiki/20250901/enwiki-20250901-pages-articles-multistream8.xml-p2134112p2936260.bz2 of=/mnt/dumpsdata/xmldatadumps/public/enwiki/20250901/enwiki-20250901-pages-articles-multistream.xml.bz2.inprog skip=575 count=566995480 iflag=skip_bytes,count_bytes oflag=append conv=notrunc bs=256k (38) started...
[2025-09-02, 17:01:33 UTC] {pod_manager.py:536} INFO - [mediawiki-dump-sql-xml] returned from 38 with 0
[2025-09-02, 17:01:33 UTC] {pod_manager.py:536} INFO - [mediawiki-dump-sql-xml] command /bin/dd if=/mnt/dumpsdata/xmldatadumps/public/enwiki/20250901/enwiki-20250901-pages-articles-multistream9.xml-p2936261p4045402.bz2 of=/mnt/dumpsdata/xmldatadumps/public/enwiki/20250901/enwiki-20250901-pages-articles-multistream.xml.bz2.inprog skip=575 count=612174348 iflag=skip_bytes,count_bytes oflag=append conv=notrunc bs=256k (39) started...
[2025-09-02, 17:01:33 UTC] {pod_manager.py:536} INFO - [mediawiki-dump-sql-xml] returned from 39 with 0
[2025-09-02, 17:01:33 UTC] {pod_manager.py:536} INFO - [mediawiki-dump-sql-xml] command /bin/dd if=/mnt/dumpsdata/xmldatadumps/public/enwiki/20250901/enwiki-20250901-pages-articles-multistream10.xml-p4045403p5399366.bz2 of=/mnt/dumpsdata/xmldatadumps/public/enwiki/20250901/enwiki-20250901-pages-articles-multistream.xml.bz2.inprog skip=575 count=604697144 iflag=skip_bytes,count_bytes oflag=append conv=notrunc bs=256k (40) started...
[2025-09-02, 17:01:33 UTC] {pod_manager.py:536} INFO - [mediawiki-dump-sql-xml] returned from 40 with 0
[2025-09-02, 17:01:33 UTC] {pod_manager.py:536} INFO - [mediawiki-dump-sql-xml] command /bin/dd if=/mnt/dumpsdata/xmldatadumps/public/enwiki/20250901/enwiki-20250901-pages-articles-multistream11.xml-p5399367p6899366.bz2 of=/mnt/dumpsdata/xmldatadumps/public/enwiki/20250901/enwiki-20250901-pages-articles-multistream.xml.bz2.inprog skip=575 count=584143422 iflag=skip_bytes,count_bytes oflag=append conv=notrunc bs=256k (41) started...
[2025-09-02, 17:01:33 UTC] {pod_manager.py:536} INFO - [mediawiki-dump-sql-xml] returned from 41 with 0
[2025-09-02, 17:01:33 UTC] {pod_manager.py:536} INFO - [mediawiki-dump-sql-xml] command /bin/dd if=/mnt/dumpsdata/xmldatadumps/public/enwiki/20250901/enwiki-20250901-pages-articles-multistream11.xml-p6899367p7054859.bz2 of=/mnt/dumpsdata/xmldatadumps/public/enwiki/20250901/enwiki-20250901-pages-articles-multistream.xml.bz2.inprog skip=575 count=55527284 iflag=skip_bytes,count_bytes oflag=append conv=notrunc bs=256k (42) started...
[2025-09-02, 17:01:33 UTC] {pod_manager.py:536} INFO - [mediawiki-dump-sql-xml] returned from 42 with 0
[2025-09-02, 17:01:33 UTC] {pod_manager.py:536} INFO - [mediawiki-dump-sql-xml] command /bin/dd if=/mnt/dumpsdata/xmldatadumps/public/enwiki/20250901/enwiki-20250901-pages-articles-multistream12.xml-p7054860p8554859.bz2 of=/mnt/dumpsdata/xmldatadumps/public/enwiki/20250901/enwiki-20250901-pages-articles-multistream.xml.bz2.inprog skip=575 count=484432593 iflag=skip_bytes,count_bytes oflag=append conv=notrunc bs=256k (43) started...
[2025-09-02, 17:01:33 UTC] {pod_manager.py:536} INFO - [mediawiki-dump-sql-xml] returned from 43 with 0
[2025-09-02, 17:01:33 UTC] {pod_manager.py:536} INFO - [mediawiki-dump-sql-xml] command /bin/dd if=/mnt/dumpsdata/xmldatadumps/public/enwiki/20250901/enwiki-20250901-pages-articles-multistream12.xml-p8554860p9172788.bz2 of=/mnt/dumpsdata/xmldatadumps/public/enwiki/20250901/enwiki-20250901-pages-articles-multistream.xml.bz2.inprog skip=575 count=194858561 iflag=skip_bytes,count_bytes oflag=append conv=notrunc bs=256k (44) started...
[2025-09-02, 17:01:33 UTC] {pod_manager.py:536} INFO - [mediawiki-dump-sql-xml] returned from 44 with 0
[2025-09-02, 17:01:33 UTC] {pod_manager.py:536} INFO - [mediawiki-dump-sql-xml] command /bin/dd if=/mnt/dumpsdata/xmldatadumps/public/enwiki/20250901/enwiki-20250901-pages-articles-multistream13.xml-p9172789p10672788.bz2 of=/mnt/dumpsdata/xmldatadumps/public/enwiki/20250901/enwiki-20250901-pages-articles-multistream.xml.bz2.inprog skip=575 count=394288472 iflag=skip_bytes,count_bytes oflag=append conv=notrunc bs=256k (45) started...
[2025-09-02, 17:01:33 UTC] {pod_manager.py:536} INFO - [mediawiki-dump-sql-xml] returned from 45 with 0
[2025-09-02, 17:01:33 UTC] {pod_manager.py:536} INFO - [mediawiki-dump-sql-xml] command /bin/dd if=/mnt/dumpsdata/xmldatadumps/public/enwiki/20250901/enwiki-20250901-pages-articles-multistream13.xml-p10672789p11659682.bz2 of=/mnt/dumpsdata/xmldatadumps/public/enwiki/20250901/enwiki-20250901-pages-articles-multistream.xml.bz2.inprog skip=575 count=272909994 iflag=skip_bytes,count_bytes oflag=append conv=notrunc bs=256k (46) started...
[2025-09-02, 17:01:33 UTC] {pod_manager.py:536} INFO - [mediawiki-dump-sql-xml] returned from 46 with 0
[2025-09-02, 17:01:33 UTC] {pod_manager.py:536} INFO - [mediawiki-dump-sql-xml] command /bin/dd if=/mnt/dumpsdata/xmldatadumps/public/enwiki/20250901/enwiki-20250901-pages-articles-multistream14.xml-p11659683p13159682.bz2 of=/mnt/dumpsdata/xmldatadumps/public/enwiki/20250901/enwiki-20250901-pages-articles-multistream.xml.bz2.inprog skip=575 count=476604880 iflag=skip_bytes,count_bytes oflag=append conv=notrunc bs=256k (47) started...
[2025-09-02, 17:01:33 UTC] {pod_manager.py:536} INFO - [mediawiki-dump-sql-xml] returned from 47 with 0
[2025-09-02, 17:01:33 UTC] {pod_manager.py:536} INFO - [mediawiki-dump-sql-xml] command /bin/dd if=/mnt/dumpsdata/xmldatadumps/public/enwiki/20250901/enwiki-20250901-pages-articles-multistream14.xml-p13159683p14324602.bz2 of=/mnt/dumpsdata/xmldatadumps/public/enwiki/20250901/enwiki-20250901-pages-articles-multistream.xml.bz2.inprog skip=575 count=324105556 iflag=skip_bytes,count_bytes oflag=append conv=notrunc bs=256k (48) started...
[2025-09-02, 17:01:33 UTC] {pod_manager.py:536} INFO - [mediawiki-dump-sql-xml] returned from 48 with 0
[2025-09-02, 17:01:33 UTC] {pod_manager.py:536} INFO - [mediawiki-dump-sql-xml] command /bin/dd if=/mnt/dumpsdata/xmldatadumps/public/enwiki/20250901/enwiki-20250901-pages-articles-multistream15.xml-p14324603p15824602.bz2 of=/mnt/dumpsdata/xmldatadumps/public/enwiki/20250901/enwiki-20250901-pages-articles-multistream.xml.bz2.inprog skip=575 count=422869565 iflag=skip_bytes,count_bytes oflag=append conv=notrunc bs=256k (49) started...
[2025-09-02, 17:01:33 UTC] {pod_manager.py:536} INFO - [mediawiki-dump-sql-xml] returned from 49 with 0
[2025-09-02, 17:01:33 UTC] {pod_manager.py:536} INFO - [mediawiki-dump-sql-xml] command /bin/dd if=/mnt/dumpsdata/xmldatadumps/public/enwiki/20250901/enwiki-20250901-pages-articles-multistream15.xml-p15824603p17324602.bz2 of=/mnt/dumpsdata/xmldatadumps/public/enwiki/20250901/enwiki-20250901-pages-articles-multistream.xml.bz2.inprog skip=575 count=367824721 iflag=skip_bytes,count_bytes oflag=append conv=notrunc bs=256k (50) started...
[2025-09-02, 17:01:39 UTC] {pod_manager.py:536} INFO - [mediawiki-dump-sql-xml] returned from 50 with 0
[2025-09-02, 17:01:39 UTC] {pod_manager.py:536} INFO - [mediawiki-dump-sql-xml] command /bin/dd if=/mnt/dumpsdata/xmldatadumps/public/enwiki/20250901/enwiki-20250901-pages-articles-multistream15.xml-p17324603p17460152.bz2 of=/mnt/dumpsdata/xmldatadumps/public/enwiki/20250901/enwiki-20250901-pages-articles-multistream.xml.bz2.inprog skip=575 count=33335297 iflag=skip_bytes,count_bytes oflag=append conv=notrunc bs=256k (51) started...
[2025-09-02, 17:01:39 UTC] {pod_manager.py:536} INFO - [mediawiki-dump-sql-xml] returned from 51 with 0
[2025-09-02, 17:01:39 UTC] {pod_manager.py:536} INFO - [mediawiki-dump-sql-xml] command /bin/dd if=/mnt/dumpsdata/xmldatadumps/public/enwiki/20250901/enwiki-20250901-pages-articles-multistream16.xml-p17460153p18960152.bz2 of=/mnt/dumpsdata/xmldatadumps/public/enwiki/20250901/enwiki-20250901-pages-articles-multistream.xml.bz2.inprog skip=575 count=399985796 iflag=skip_bytes,count_bytes oflag=append conv=notrunc bs=256k (52) started...
[2025-09-02, 17:01:39 UTC] {pod_manager.py:536} INFO - [mediawiki-dump-sql-xml] returned from 52 with 0
[2025-09-02, 17:01:39 UTC] {pod_manager.py:536} INFO - [mediawiki-dump-sql-xml] command /bin/dd if=/mnt/dumpsdata/xmldatadumps/public/enwiki/20250901/enwiki-20250901-pages-articles-multistream16.xml-p18960153p20460152.bz2 of=/mnt/dumpsdata/xmldatadumps/public/enwiki/20250901/enwiki-20250901-pages-articles-multistream.xml.bz2.inprog skip=575 count=371647793 iflag=skip_bytes,count_bytes oflag=append conv=notrunc bs=256k (53) started...
[2025-09-02, 17:01:39 UTC] {pod_manager.py:536} INFO - [mediawiki-dump-sql-xml] returned from 53 with 0
[2025-09-02, 17:01:39 UTC] {pod_manager.py:536} INFO - [mediawiki-dump-sql-xml] command /bin/dd if=/mnt/dumpsdata/xmldatadumps/public/enwiki/20250901/enwiki-20250901-pages-articles-multistream16.xml-p20460153p20570392.bz2 of=/mnt/dumpsdata/xmldatadumps/public/enwiki/20250901/enwiki-20250901-pages-articles-multistream.xml.bz2.inprog skip=575 count=27115131 iflag=skip_bytes,count_bytes oflag=append conv=notrunc bs=256k (54) started...
[2025-09-02, 17:01:39 UTC] {pod_manager.py:536} INFO - [mediawiki-dump-sql-xml] returned from 54 with 0
[2025-09-02, 17:01:39 UTC] {pod_manager.py:536} INFO - [mediawiki-dump-sql-xml] command /bin/dd if=/mnt/dumpsdata/xmldatadumps/public/enwiki/20250901/enwiki-20250901-pages-articles-multistream17.xml-p20570393p22070392.bz2 of=/mnt/dumpsdata/xmldatadumps/public/enwiki/20250901/enwiki-20250901-pages-articles-multistream.xml.bz2.inprog skip=575 count=413306195 iflag=skip_bytes,count_bytes oflag=append conv=notrunc bs=256k (55) started...
[2025-09-02, 17:01:39 UTC] {pod_manager.py:536} INFO - [mediawiki-dump-sql-xml] returned from 55 with 0
[2025-09-02, 17:01:39 UTC] {pod_manager.py:536} INFO - [mediawiki-dump-sql-xml] command /bin/dd if=/mnt/dumpsdata/xmldatadumps/public/enwiki/20250901/enwiki-20250901-pages-articles-multistream17.xml-p22070393p23570392.bz2 of=/mnt/dumpsdata/xmldatadumps/public/enwiki/20250901/enwiki-20250901-pages-articles-multistream.xml.bz2.inprog skip=575 count=425745616 iflag=skip_bytes,count_bytes oflag=append conv=notrunc bs=256k (56) started...
[2025-09-02, 17:01:39 UTC] {pod_manager.py:536} INFO - [mediawiki-dump-sql-xml] returned from 56 with 0
[2025-09-02, 17:01:39 UTC] {pod_manager.py:536} INFO - [mediawiki-dump-sql-xml] command /bin/dd if=/mnt/dumpsdata/xmldatadumps/public/enwiki/20250901/enwiki-20250901-pages-articles-multistream17.xml-p23570393p23716197.bz2 of=/mnt/dumpsdata/xmldatadumps/public/enwiki/20250901/enwiki-20250901-pages-articles-multistream.xml.bz2.inprog skip=575 count=47371039 iflag=skip_bytes,count_bytes oflag=append conv=notrunc bs=256k (57) started...
[2025-09-02, 17:01:39 UTC] {pod_manager.py:536} INFO - [mediawiki-dump-sql-xml] returned from 57 with 0
[2025-09-02, 17:01:39 UTC] {pod_manager.py:536} INFO - [mediawiki-dump-sql-xml] command /bin/dd if=/mnt/dumpsdata/xmldatadumps/public/enwiki/20250901/enwiki-20250901-pages-articles-multistream18.xml-p23716198p25216197.bz2 of=/mnt/dumpsdata/xmldatadumps/public/enwiki/20250901/enwiki-20250901-pages-articles-multistream.xml.bz2.inprog skip=575 count=439386652 iflag=skip_bytes,count_bytes oflag=append conv=notrunc bs=256k (58) started...
[2025-09-02, 17:01:39 UTC] {pod_manager.py:536} INFO - [mediawiki-dump-sql-xml] returned from 58 with 0
[2025-09-02, 17:01:39 UTC] {pod_manager.py:536} INFO - [mediawiki-dump-sql-xml] command /bin/dd if=/mnt/dumpsdata/xmldatadumps/public/enwiki/20250901/enwiki-20250901-pages-articles-multistream18.xml-p25216198p26716197.bz2 of=/mnt/dumpsdata/xmldatadumps/public/enwiki/20250901/enwiki-20250901-pages-articles-multistream.xml.bz2.inprog skip=575 count=402738120 iflag=skip_bytes,count_bytes oflag=append conv=notrunc bs=256k (59) started...
[2025-09-02, 17:01:39 UTC] {pod_manager.py:536} INFO - [mediawiki-dump-sql-xml] returned from 59 with 0
[2025-09-02, 17:01:39 UTC] {pod_manager.py:536} INFO - [mediawiki-dump-sql-xml] command /bin/dd if=/mnt/dumpsdata/xmldatadumps/public/enwiki/20250901/enwiki-20250901-pages-articles-multistream18.xml-p26716198p27121850.bz2 of=/mnt/dumpsdata/xmldatadumps/public/enwiki/20250901/enwiki-20250901-pages-articles-multistream.xml.bz2.inprog skip=575 count=103645430 iflag=skip_bytes,count_bytes oflag=append conv=notrunc bs=256k (60) started...
[2025-09-02, 17:01:39 UTC] {pod_manager.py:536} INFO - [mediawiki-dump-sql-xml] returned from 60 with 0
[2025-09-02, 17:01:39 UTC] {pod_manager.py:536} INFO - [mediawiki-dump-sql-xml] command /bin/dd if=/mnt/dumpsdata/xmldatadumps/public/enwiki/20250901/enwiki-20250901-pages-articles-multistream19.xml-p27121851p28621850.bz2 of=/mnt/dumpsdata/xmldatadumps/public/enwiki/20250901/enwiki-20250901-pages-articles-multistream.xml.bz2.inprog skip=575 count=395783380 iflag=skip_bytes,count_bytes oflag=append conv=notrunc bs=256k (61) started...
[2025-09-02, 17:01:39 UTC] {pod_manager.py:536} INFO - [mediawiki-dump-sql-xml] returned from 61 with 0
[2025-09-02, 17:01:39 UTC] {pod_manager.py:536} INFO - [mediawiki-dump-sql-xml] command /bin/dd if=/mnt/dumpsdata/xmldatadumps/public/enwiki/20250901/enwiki-20250901-pages-articles-multistream19.xml-p28621851p30121850.bz2 of=/mnt/dumpsdata/xmldatadumps/public/enwiki/20250901/enwiki-20250901-pages-articles-multistream.xml.bz2.inprog skip=575 count=348261124 iflag=skip_bytes,count_bytes oflag=append conv=notrunc bs=256k (62) started...
[2025-09-02, 17:01:39 UTC] {pod_manager.py:536} INFO - [mediawiki-dump-sql-xml] returned from 62 with 0
[2025-09-02, 17:01:39 UTC] {pod_manager.py:536} INFO - [mediawiki-dump-sql-xml] command /bin/dd if=/mnt/dumpsdata/xmldatadumps/public/enwiki/20250901/enwiki-20250901-pages-articles-multistream19.xml-p30121851p31308442.bz2 of=/mnt/dumpsdata/xmldatadumps/public/enwiki/20250901/enwiki-20250901-pages-articles-multistream.xml.bz2.inprog skip=575 count=328706967 iflag=skip_bytes,count_bytes oflag=append conv=notrunc bs=256k (63) started...
[2025-09-02, 17:01:39 UTC] {pod_manager.py:536} INFO - [mediawiki-dump-sql-xml] returned from 63 with 0
[2025-09-02, 17:01:39 UTC] {pod_manager.py:536} INFO - [mediawiki-dump-sql-xml] command /bin/dd if=/mnt/dumpsdata/xmldatadumps/public/enwiki/20250901/enwiki-20250901-pages-articles-multistream20.xml-p31308443p32808442.bz2 of=/mnt/dumpsdata/xmldatadumps/public/enwiki/20250901/enwiki-20250901-pages-articles-multistream.xml.bz2.inprog skip=575 count=444011684 iflag=skip_bytes,count_bytes oflag=append conv=notrunc bs=256k (64) started...
[2025-09-02, 17:01:39 UTC] {pod_manager.py:536} INFO - [mediawiki-dump-sql-xml] returned from 64 with 0
[2025-09-02, 17:01:39 UTC] {pod_manager.py:536} INFO - [mediawiki-dump-sql-xml] command /bin/dd if=/mnt/dumpsdata/xmldatadumps/public/enwiki/20250901/enwiki-20250901-pages-articles-multistream20.xml-p32808443p34308442.bz2 of=/mnt/dumpsdata/xmldatadumps/public/enwiki/20250901/enwiki-20250901-pages-articles-multistream.xml.bz2.inprog skip=575 count=408883992 iflag=skip_bytes,count_bytes oflag=append conv=notrunc bs=256k (65) started...
[2025-09-02, 17:01:39 UTC] {pod_manager.py:536} INFO - [mediawiki-dump-sql-xml] returned from 65 with 0
[2025-09-02, 17:01:39 UTC] {pod_manager.py:536} INFO - [mediawiki-dump-sql-xml] command /bin/dd if=/mnt/dumpsdata/xmldatadumps/public/enwiki/20250901/enwiki-20250901-pages-articles-multistream20.xml-p34308443p35522432.bz2 of=/mnt/dumpsdata/xmldatadumps/public/enwiki/20250901/enwiki-20250901-pages-articles-multistream.xml.bz2.inprog skip=575 count=312524365 iflag=skip_bytes,count_bytes oflag=append conv=notrunc bs=256k (66) started...
[2025-09-02, 17:01:39 UTC] {pod_manager.py:536} INFO - [mediawiki-dump-sql-xml] returned from 66 with 0
[2025-09-02, 17:01:39 UTC] {pod_manager.py:536} INFO - [mediawiki-dump-sql-xml] command /bin/dd if=/mnt/dumpsdata/xmldatadumps/public/enwiki/20250901/enwiki-20250901-pages-articles-multistream21.xml-p35522433p37022432.bz2 of=/mnt/dumpsdata/xmldatadumps/public/enwiki/20250901/enwiki-20250901-pages-articles-multistream.xml.bz2.inprog skip=575 count=416656185 iflag=skip_bytes,count_bytes oflag=append conv=notrunc bs=256k (67) started...
[2025-09-02, 17:01:39 UTC] {pod_manager.py:536} INFO - [mediawiki-dump-sql-xml] returned from 67 with 0
[2025-09-02, 17:01:39 UTC] {pod_manager.py:536} INFO - [mediawiki-dump-sql-xml] command /bin/dd if=/mnt/dumpsdata/xmldatadumps/public/enwiki/20250901/enwiki-20250901-pages-articles-multistream21.xml-p37022433p38522432.bz2 of=/mnt/dumpsdata/xmldatadumps/public/enwiki/20250901/enwiki-20250901-pages-articles-multistream.xml.bz2.inprog skip=575 count=399829816 iflag=skip_bytes,count_bytes oflag=append conv=notrunc bs=256k (68) started...
[2025-09-02, 17:01:39 UTC] {pod_manager.py:536} INFO - [mediawiki-dump-sql-xml] returned from 68 with 0
[2025-09-02, 17:01:39 UTC] {pod_manager.py:536} INFO - [mediawiki-dump-sql-xml] command /bin/dd if=/mnt/dumpsdata/xmldatadumps/public/enwiki/20250901/enwiki-20250901-pages-articles-multistream21.xml-p38522433p39996245.bz2 of=/mnt/dumpsdata/xmldatadumps/public/enwiki/20250901/enwiki-20250901-pages-articles-multistream.xml.bz2.inprog skip=575 count=408812855 iflag=skip_bytes,count_bytes oflag=append conv=notrunc bs=256k (69) started...
[2025-09-02, 17:01:39 UTC] {pod_manager.py:536} INFO - [mediawiki-dump-sql-xml] returned from 69 with 0
[2025-09-02, 17:01:39 UTC] {pod_manager.py:536} INFO - [mediawiki-dump-sql-xml] command /bin/dd if=/mnt/dumpsdata/xmldatadumps/public/enwiki/20250901/enwiki-20250901-pages-articles-multistream22.xml-p39996246p41496245.bz2 of=/mnt/dumpsdata/xmldatadumps/public/enwiki/20250901/enwiki-20250901-pages-articles-multistream.xml.bz2.inprog skip=575 count=402631223 iflag=skip_bytes,count_bytes oflag=append conv=notrunc bs=256k (70) started...
[2025-09-02, 17:01:39 UTC] {pod_manager.py:536} INFO - [mediawiki-dump-sql-xml] returned from 70 with 0
[2025-09-02, 17:01:39 UTC] {pod_manager.py:536} INFO - [mediawiki-dump-sql-xml] command /bin/dd if=/mnt/dumpsdata/xmldatadumps/public/enwiki/20250901/enwiki-20250901-pages-articles-multistream22.xml-p41496246p42996245.bz2 of=/mnt/dumpsdata/xmldatadumps/public/enwiki/20250901/enwiki-20250901-pages-articles-multistream.xml.bz2.inprog skip=575 count=431749276 iflag=skip_bytes,count_bytes oflag=append conv=notrunc bs=256k (71) started...
[2025-09-02, 17:01:39 UTC] {pod_manager.py:536} INFO - [mediawiki-dump-sql-xml] returned from 71 with 0
[2025-09-02, 17:01:39 UTC] {pod_manager.py:536} INFO - [mediawiki-dump-sql-xml] command /bin/dd if=/mnt/dumpsdata/xmldatadumps/public/enwiki/20250901/enwiki-20250901-pages-articles-multistream22.xml-p42996246p44496245.bz2 of=/mnt/dumpsdata/xmldatadumps/public/enwiki/20250901/enwiki-20250901-pages-articles-multistream.xml.bz2.inprog skip=575 count=418094555 iflag=skip_bytes,count_bytes oflag=append conv=notrunc bs=256k (72) started...
[2025-09-02, 17:06:14 UTC] {pod_manager.py:536} INFO - [mediawiki-dump-sql-xml] returned from 72 with 0
[2025-09-02, 17:06:14 UTC] {pod_manager.py:536} INFO - [mediawiki-dump-sql-xml] /srv/deployment/dumps/xmldumps-backup/worker: line 193:     9 Killed                  python3 ${pythonargs[@]}
[2025-09-02, 17:06:17 UTC] {pod_manager.py:555} INFO - [mediawiki-dump-sql-xml] command /bin/dd if=/mnt/dumpsdata/xmldatadumps/public/enwiki/20250901/enwiki-20250901-pages-articles-multistream22.xml-p44496246p44788941.bz2 of=/mnt/dumpsdata/xmldatadumps/public/enwiki/20250901/enwiki-20250901-pages-articles-multistream.xml.bz2.inprog skip=575 count=66123856 iflag=skip_bytes,count_bytes oflag=append conv=notrunc bs=256k (73) started... Dump of wiki enwiki failed.
[2025-09-02, 17:06:17 UTC] {pod_manager.py:582} WARNING - Pod enwiki-sql-xml-enwiki-dump-articlesmultistreamdumprecombine-full-lv7a60q log read interrupted but container mediawiki-dump-sql-xml still running. Logs generated in the last one second might get duplicated.
[2025-09-02, 17:06:18 UTC] {pod_manager.py:536} INFO - [mediawiki-dump-sql-xml] /srv/deployment/dumps/xmldumps-backup/worker: line 193:     9 Killed                  python3 ${pythonargs[@]}
[2025-09-02, 17:06:18 UTC] {pod_manager.py:555} INFO - [mediawiki-dump-sql-xml] command /bin/dd if=/mnt/dumpsdata/xmldatadumps/public/enwiki/20250901/enwiki-20250901-pages-articles-multistream22.xml-p44496246p44788941.bz2 of=/mnt/dumpsdata/xmldatadumps/public/enwiki/20250901/enwiki-20250901-pages-articles-multistream.xml.bz2.inprog skip=575 count=66123856 iflag=skip_bytes,count_bytes oflag=append conv=notrunc bs=256k (73) started... Dump of wiki enwiki failed

At the end, it failed. We no longer have the metrics available for this pod, but I suspect that it was an out-of-memory error that did for it.

The the task was rescheduled by Airflow and the same command would have been executed again:

/bin/dd if=/mnt/dumpsdata/xmldatadumps/public/enwiki/20250901/enwiki-20250901-pages-articles-multistream1.xml-p1p41242.bz2 of=/mnt/dumpsdata/xmldatadumps/public/enwiki/20250901/enwiki-20250901-pages-articles-multistream.xml.bz2.inprog skip=0 count=575 iflag=skip_bytes,count_bytes oflag=append conv=notrunc bs=256k

I believe that the problem arises here due to the use of the oflag=append and conv=notrunc flags to dd. I believe this means that it will not truncate an existing file to zero bytes, but will always append to it.
This will cause issues when a partially written file exists from a previous failed attempt to create this file.

I will look to see what options we have to correct this, for 20250901 and in future.

Well, this is not exactly nice. I think that we will have to put something in the code here, to remove any existing file before starting the subsequent dd operations.

This set of dd commands was defined over 10 years ago, so I'm not very keen to mess with it, but I'm not sure that there is any other way.

I can fix the issue for the existing files by:

  • manually removing the concatenated file from the cephfs volume:
www-data@mediawiki-dumps-legacy-toolbox-5fb95c7ff6-5f2vs:/mnt/dumpsdata/xmldatadumps/public/enwiki/20250901$ ls -lh enwiki-20250901-pages-articles-multistream-index.txt.bz2 enwiki-20250901-pages-articles-multistream.xml.bz2
-rw-rw-r-- 1 www-data www-data 262M Sep  2 17:58 enwiki-20250901-pages-articles-multistream-index.txt.bz2
-rw-rw-r-- 1 www-data www-data  75G Sep  2 17:56 enwiki-20250901-pages-articles-multistream.xml.bz2
www-data@mediawiki-dumps-legacy-toolbox-5fb95c7ff6-5f2vs:/mnt/dumpsdata/xmldatadumps/public/enwiki/20250901$
  • then clearing the dump_articlesmultistreamdumprecombine_full and sync_articlesmultistreamdumprecombine_full jobs in Airflow.

That will recreate the files and then update the published copies.

I have renamed the files, so they will still be available for download, should anyone wish to do that.

www-data@mediawiki-dumps-legacy-toolbox-5fb95c7ff6-5f2vs:/mnt/dumpsdata/xmldatadumps/public/enwiki/20250901$ mv enwiki-20250901-pages-articles-multistream-index.txt.bz2 enwiki-20250901-pages-articles-multistream-index.txt.bz2-T403793
www-data@mediawiki-dumps-legacy-toolbox-5fb95c7ff6-5f2vs:/mnt/dumpsdata/xmldatadumps/public/enwiki/20250901$ mv enwiki-20250901-pages-articles-multistream.xml.bz2 enwiki-20250901-pages-articles-multistream.xml.bz2-T403793

Cleared the tasks in Airflow.

image.png (886×1 px, 185 KB)

Run number 5 of enwiki.dump_articlesmultistreamdumprecombine_full is now running, but there is no existing file there to get appended to.

I'll monitor this to make sure that it doesn't get killed. I think that we should be OK for resources, since we merged this last week.

Oh, it looks like it didn't concatenate any files. Maybe it just read the dumpstatus.json file and ascertained that the job had already completed.

I'll try this again on Monday.

Hi @BTullis, and thanks both for the extensive triage and the very useful info on how the dumps are working in the background! There's a lot to unpack here, but I'll work through the triages bit-by-bit - Mostly in posted order, but recognising my habit of writing textwalls I'm going to focus on one possible quick-fix first of all:


So it's not lbzip2 that recombines them, that's just the multi-threaded compression/decompression utility. It turns out that we just use dd to contatenate these files. [snip] I believe that the problem arises here due to the use of the oflag=append and conv=notrunc flags to dd. I believe this means that it will not truncate an existing file to zero bytes, but will always append to it. This will cause issues when a partially written file exists from a previous failed attempt to create this file.

I use dd extensively in my day-to-day life (Everything from creating zero-length marker files to sparse disk images exceeding physical storage by many orders of magnitude) and I can confirm that's exactly what dd will do. Those two flags together are generally essential when dd is being employed to assemble an output file from several smaller input files.

Well, this is not exactly nice. I think that we will have to put something in the code here, to remove any existing file before starting the subsequent dd operations. This set of dd commands was defined over 10 years ago, so I'm not very keen to mess with it, but I'm not sure that there is any other way.

I've had a quick look at that and I need to note I have very little experience in Python, but it looks to me a lot like it builds an array of dd commands that are then passed on like (Or as) a shell script. I'll also note that it's had just one revision in the past five years (Before that, the last was 2020-05-14 22:50:23 +0300 20200514T195023Z; About a month into COVID) and I'm worried about an apparent weakness (Coded assumption) I can see at lines 127-130 inclusive.

It might be possible to fix the problem by amending the process slightly, rather than the code. e.g. instead of;

# Job starts:
dd [params] of=/mnt/dumpsdata/xmldatadumps/public/enwiki/20250901/enwiki-20250901-pages-articles-multistream.xml.bz2.inprog
...
mv ./enwiki-20250901-pages-articles-multistream.xml.bz2.inprog ./enwiki-20250901-pages-articles-multistream.xml.bz2
# Job ends.

consider;

# Job starts. If the statically-named temp file ./enwiki-pam-xml-bz2.temp already exists, delete it:
if [ -e ./enwiki-pam-xml-bz2.temp ]; do rm ./enwiki-pam-xml-bz2.temp; fi;
dd [params] of=./enwiki-pam-xml-bz2.temp
...
mv ./enwiki-pam-xml-bz2.temp ./enwiki-20250901-pages-articles-multistream.xml.bz2
# Job ends.

Basically; Changing the dump building scripts to use a static filename means that adding an if this file exists, then delete it at the very start of the process means this sort of error is eliminated by removing the temp file before starting, then moving that temp file to its modern/standard filename at the very end of the process once the dump has been fully assembled and is known to have been exported correctly.

However, there is an important caveat: Unlike on my systems (Where a failure or induced bug just means 1-2 systems going down, the only inconvenience being to myself, and nobody else noticing) there's always the possibility some other code/process somewhere else is dependent on the assumption that the code will continue to work in the way it presently does, so prior to switching to a static temp file name checks should be made to ensure that changing the tempfile name won't break something critically important elsewhere, as if it did it might have widespread, far-reaching and high profile impacts.


Initial inspection suggests that this might be related to the fact that the dump_articlesmultistreamdumprecombine_full job ran [f]our times. So I suspect that what's happening is that when the multistreamdumprecombine jobs were running, they crashed out with an oom error, but left a temporary file. Each time the job restarted, lbzip2 started concatenating the constituent files to the existing temp file, rather than starting again.

I'd wondered if that's what had happened, but not having the resources to download/handle files that large (Plus being worried about WMFs data bills) I wasn't really able to probe for this at my end; I was considering making HTTP RANGE requests for sections close to the assumed size boundary and seeing if an output file had been repeatedly appended to, but not knowing exactly how long the compressed output should have been this would've been a process with a lot more error than trial to it, rather exhaustive, and highly resource-wasteful compared to simply re-running the dump job.

I can assure you, however, that WMF is still very much committed to creating and publishing dumps, so they are not going to disappear from our servers. That said, we are also very grateful to you for any and all additional torrenting and uploading to archive.org for these files. From our perspective, the more backups and mirrors, the better.

No worries! I've long been bothered by the lack of mirroring of non-English Wikipedia content outside of the main mirrors themselves, which is why I've started working on mirroring that content to IA and torrent every few months. I don't have the resources to seed data directly, but with IA's mission aligning closely to WMFs and their service supporting using collection data for web-seeded torrents, I'm effectively employing IA as a substitute for my own torrent seed box (Though I hope to be able to install torrent seeds on a couple of local connections over the next month or two if I can).

Also; I wasn't suggesting that WMF might cease hosting dumps by choice, I meant what if a major technological failure, lasting several months, results in dumps no longer being publicly available from WMF servers for an extended period of time? :-)

What's worrying me about this bug is what it might be doing for WMFs data export bills

That's good of you to flag the bandwidth concerns for us. In this case, I don't think that the increased file sizes will cause a particular concern to our traffic team, as the dumps file downloads are still a relatively small proportion of our overall network egress traffic from this data centre. Also, our architecture choices mean that we do not pay a metered charge for bandwidth use.

That's good to hear, and many thanks for that info! Given WM is probably one of the largest servers of data second only to media-heavy sites like YouTube, you wonder what sort of costs WMF (As a 501(c)(3) nonprofit) incur for data export and the fundraising they have to do to keep those bills paid! As a man who struggles with energy costs to the extent of running his home on just 800Wh/day (And is probably the UKs lowest user of domestic energy) thinking of WMFs bills (Compared to YouTube/Facebook/etc, who can pay those from commercial revenue) scares the living Brexit out of me!

I think that the more concerning aspects of your report are those around data quality. e.g.

The MD5 and SHA-1 checksum files are completely the wrong sizes for an enwiki dump.

This is pretty serious, so we need to investigate whether there is a more systemic issue around how our Airflow pipelines generate the checksum files, as well as the multistream file generation process.

That stood out to me as a red-flag when I noticed the unusually small filesizes, bearing in mind the very first stage of my cloning process is a wget request for the SHA-1 and MD5 hashes directly from the WMF dump server to validate the integrity of files I'll be pulling from the mirrors. I've uploaded copies of the (truncated/incorrectly produced) hashsums that I was served just in case that helps with triage, and I'll add that individual hash files for each data file are showing in the directory listings on mirror.accum.se and other mirrors - I don't know if those are part of the WMF dataset that's provided for mirrors, or is an outcome of software running at the mirror end.

Original comment above / Post-restart addendums below:

This set of dd commands was defined over 10 years ago, so I'm not very keen to mess with it, but I'm not sure that there is any other way.

I want to return to this as it highlights a potentially severe problem with technological legacies and unexpected loss of essential systems knowledge. Noticing that the last regularly maintained revision of that code was just after COVID broke out (And the sole revision after that seems to have been a quick-fix to meet the new dumps framework) I worry that the person(s) who maintained that code, knew how it worked and what it did, might have been lost to the pandemic. If so this means WMFs dump process sits heavily on code that was built by people who are no longer with us (RIP) and knowledge essential to its maintenance might have been lost alongside them.
The core foundations of technology were built by people of the generation before mine, my generation has built on that close enough to the foundations we could build our own if it was necessary, and the generations after mine are building on top of our code in such a way that if our bricks or the ones below start to crumble, the entire building risks falling down because of the dwindling knowledge of the first- and second-generations of code that everything else was built on.

Speaking as a person with severe ASD who can understand exactly how his code works but really struggles to communicate those understandings to others in such a way they could continue to maintain my code after I've gone, the likelihood that there's a lot of undocumented essential systems knowledge and understanding that's critical to maintaining WMFs services can't be disregarded.
As an example of a defence against this problem; Linux Mint maintains a Debian-based fork of its distro to cover for any circumstance where the main Ubuntu base becomes unusable or heads in a direction (e.g: Commercially-driven ensh*tification) that LM isn't happy with.

I can fix the issue for the existing files by:

  • manually removing the concatenated file from the cephfs volume:
  • then clearing the dump_articlesmultistreamdumprecombine_full and sync_articlesmultistreamdumprecombine_full jobs in Airflow.

That will recreate the files and then update the published copies.

Run number 5 of enwiki.dump_articlesmultistreamdumprecombine_full is now running, but there is no existing file there to get appended to.

I'll monitor this to make sure that it doesn't get killed. I think that we should be OK for resources, since we merged this last week.

Oh, it looks like it didn't concatenate any files. Maybe it just read the dumpstatus.json file and ascertained that the job had already completed. I'll try this again on Monday.

I have a listing of the .bz2 parts here and a copy of the log file output you provided in T403793#11176752 from the previous dump attempts. Assuming you have access and the necessary permissions (drwxrwxr--) to save and run a script interactively from inside that folder, I'll try to cook-up a BaSh script that should be able to re-build the enwiki-20250901-pages-articles.xml.bz2 using similar dd commands to those produced by the Python script.

If so, I'll attach that to a new comment here. ^_^
ᛒᚱᛟᚴᛂᚾ ᚢᛁᚴᛁᚾᚷ (Broken Viking)

Following-up from earlier - And with huge thanks to @BTullis for the logfile output; Which has given me enough information to understand how the pages-articles-multistream dumps are being dd-assembled from the component files - I've composed a script that should be able to reassemble enwiki-20250901-pages-articles-multistream.xml.bz2 from the components we have. I've attached it to this reply, with the advisory that you must read the accompanying ReadMe-First.txt for essential preparation steps before the script will run. (In addition; As it doesn't need elevated permissions, it will refuse to operate under a root account.)


I was also going to do this for the index files too, but to be honest staring at a text editor for most of Sunday afternoon has turned my brain completely to mush. On the assumption the index file was/is also broken, I'll have to come back to this as soon as my brain has had a break and can handle another BaShIng session. (-:

Multistream dump assembly:

In a nutshell; The way that a single ??wiki-20250901-pages-articles-multistream.xml.bz2 dump is assembled from the individual ./??wiki-20250901-pages-articles-multistreamXX.xml-pXpY.bz2 dump files is as follows:

  1. The first file - Except for the last 51 bytes - Is copied to the output file.
  2. The second to n-1 files - Except for the first 575 and last 51 bytes of each file - Are appended to the output file in numbered order.
  3. The last file - Except for the first 575 bytes - Is appended to the output, completing the file.

Based on the original Wikimedia generated code, some changes have been made in the dd commands that I've employed in the script:

  • Parameter ordering has been changed (In particular; count= and if= moved to end of each line) to make manual compilation of this script easier.
  • The fullblock flag has been added to the iflag= parameter to instruct dd to read a full block of data from input before passing this to the output.
  • Data block size has been quadrupled to 1MiB (1048576 bytes) with the aim of improving speed, efficiency, and reducing memory related exceptions.

While composing the attached script/materials I noticed that the original process failure happened while processing the 42nd input file enwiki-20250901-pages-articles-multistream22.xml-p42996246p44496245.bz2 with a message from the pod management/worker script (Line 193) about nine tasks being killed. After this (And presumably that last instance of dd being one of the processes killed) the dumps process continues to process enwiki-20250901-pages-articles-multistream22.xml-p44496246p44788941.bz2, the dump fails and another message appears about nine processes being killed, but after this another message shows enwiki-20250901-pages-articles-multistream22.xml-p44496246p44788941.bz2 being run a second time before the instance fails completely.

I know todays computing is multithread everything, but given the current dumps process was developed in much better times when The UK was still an EU Member State single-thread processing was a thing, might constraining the instance to single-thread working to deliver better reliability at the expense of speed be worth considering? :-)

Anyhow, I'm totally cream-crackered after that lot. Hoping the script attached does what we would like it to do... :-)
ᛒᚱᛟᚴᛂᚾ ᚢᛁᚴᛁᚾᚷ (Broken Viking)

An update on the manual dd route: I've pulled the index parts from accum.se (Which have file dates of 2025-09-02 and validate successfully against the regenerated md5/sha1 hashsum files from the WM dump server) and employed the same approach as in the script attached previously. When run the script produces an output file of sane size (261MiB / 273.606.693 bytes), and barring some lightweight semantic errors runs without a problem.

However the output refuses to validate in BZip2, which gives the following error:
./UEMP-enwiki-20250901-pages-articles-index-rebuild.bz2: data integrity (CRC) error in data
Although if bzip2recover is run over the output file it produces about 2000 files containing recovered bzip2 streams, which is probably about right for the enwiki dump. None of those files will open in File Roller (Which gives a non-specific error) and so the output is - For all intents and purposes - About as much use as a Class 153 in the rush hour for the time being.

@BTullis might it be possible to have some log output from the end of a successful dump (enwiki 2025-08-20 would probably be good), please? Seeing a log of the end of the process will help me fill in the gaps. (I'll also have a closer look at the .py you linked above when I can, but my brain is more mashed than the spuds on a British dining table atm.)

Personal experience suggests the issue is probably with the tail end of the last file; Which might have a bzip2 integrity checksum that'll only take account of the contents of that last file and not the whole stream. To work around this the correct checksum (Likely a CRC-32) for the whole stream needs to be computed and inserted into the tail, or a new tail generated with the correct checksum in it.

I'll try and get around to learning more about bzip2 at the file level rather than working on mere assumptions (Which are much quicker than studying) but spending the whole day on this only for it to be showing signs of qualifying for direct UKCA certification is hurting my confidence and motivation a lot. It's bad enough that being LGBT apparently means it's „OK“ for my focussed efforts to persistently fail while people with extensive criminal histories get instant magic and guaranteed virals from a single mouse click...
ᛒᚱᛟᚴᛂᚾ ᚢᛁᚴᛁᚾᚷ (Broken Viking)

Thanks @Broken-Viking for the fascinating update. I'll try to reply more fully when I have more time, but for now I can let you know that I believe I have fixed the corrputed files.

I manually updated the dumpruninfo.txt files like this:

www-data@mediawiki-dumps-legacy-toolbox-5fb95c7ff6-5f2vs:/$ cd /mnt/dumpsdata/xmldatadumps/public/enwiki/20250901/
www-data@mediawiki-dumps-legacy-toolbox-5fb95c7ff6-5f2vs:/mnt/dumpsdata/xmldatadumps/public/enwiki/20250901$ sed -i 's/articlesmultistreamdumprecombine; status:done/articlesmultistreamdumprecombine; status:failed/' dumpruninfo.txt

I had previously removed the originally concatenated files in T403793#11176983 so I could now re-run the task.

I cleared the task in Airflow and it successfully re-concatenated the file.

image.png (969×1 px, 497 KB)

I then cleared the corresponding sync_articlesmultistreamdumprecombine_full task and the file was then published successfully to the clouddumps servers.

image.png (833×1 px, 141 KB)

As far as I know, the checksums of these files were also recalculated, so should be correct.


I'd also just like to put your mind at rest that the engineer who was primarily responsible for crafting the dumps v1 code over many years is still with the WMF and hasn't been lost to COVID, thankfully.
These days, they are working on other projects.

Your concern about the potential fragility of dumps and the dependence on old code is a very valid point, but I would like to draw your attention once again to the Dumps 2 project: T366752
This project is about replacing this legacy system with a more modern data pipeline, using distributed data systems such as Spark and Hadoop and Iceberg. WMF has already made great strides here and we are doing our best to avoid any breaking changes in how we make the dumps available.

I'll update the ticket again with more information and I will spin off another ticket about the lack of idempotency in the *recombine tasks, so we can look at retro-fitting the safety measures in the dd operations, as you suggest.

Hi again @BTullis, and thanks for giving the server a budge/getting the dumps re-done! That saves me a 69-part download and assembly stage¹ and gets the dump back up much quicker than I could ever have managed! :-)
(¹ - Easy enough in itself; But given I'm doing 95% of my upload work on a 1st gen Raspberry Pi with 256MB RAM and a limitation to 32GB SDHC cards - Equipment low-powered and cheap enough to be left sat unattended on a friends unlimited cable connection with his awareness and consent - A 25GiB dump rebuild is going to come very close to the ceiling! :-D)

Thanks @Broken-Viking for the fascinating update. I'll try to reply more fully when I have more time, but for now I can let you know that I believe I have fixed the corrputed files. [snip] As far as I know, the checksums of these files were also recalculated, so should be correct.

That's good to hear, too. I pulled the previous checksums last night to validate the index parts I'd downloaded and noticed there wasn't a checksum for enwiki-20250901-pages-articles.xml.bz2 in it (Understandable) which is why I didn't pipe those on to the enwiki clone I've queued to go up to the IA. Given those are now stable I'll probably pipe those up tonight and look at getting the main dump up sometime on Thursday or Friday.
(n.b: enwiki's presently queued behind ptwiki and zhwiki of the same date, and having already pulled ~45GiB off accum.se this month I'm in two minds about pulling another ~25GiB off of them, even outside of academic hours. The other mirrors are either too far away in network terms, or are being very slow in sync'ing to the latest dump versions.)

And that's a point: Once I can get my brain back into a clear enough place to get the Data Dump usage guides finished up and translated into the 16 languages I'm presently mirroring, I need to get on it. Until those guides have been written, translated and rendered into portable formats, I can't progress to generating and publishing the torrents! :-o
(The way BitTorrent works means it's not possible to add files to torrents after they've been generated, as you might know.)

I'd also just like to put your mind at rest that the engineer who was primarily responsible for crafting the dumps v1 code over many years is still with the WMF and hasn't been lost to COVID, thankfully. These days, they are working on other projects.

Thanks for that! Seeing the sudden end to a previously consistent contribution history and knowing what was going on at that particular point in history had me fearing for the worst - Both for their welfare, and the possibility they might be the only person who knew the ins/outs of something that a lot of the Wiki community depend upon. Glad to see and hear that they're safe and well! :-)

Your concern about the potential fragility of dumps and the dependence on old code is a very valid point, but I would like to draw your attention once again to the Dumps 2 project: T366752. This project is about replacing this legacy system with a more modern data pipeline, using distributed data systems such as Spark and Hadoop and Iceberg. WMF has already made great strides here and we are doing our best to avoid any breaking changes in how we make the dumps available.

I have had a look at Dumps/2 but I seem to remember there was so much in the project it overwhelmed me very quickly, while digesting a single ticket like this is much easier for my brain to accommodate (And is why I envy ADHD people so much!). I'm probably going to show a generational element here, but I often struggle with newer/higher-level working ways/environments because I always feel that I and the thing I'm working with are much better in the closest proximity compared to; Working through an AI task builder, that passes its output to a task scheduler running on AWS, which then sends task commands in Language A off to an instance on Azure, that converts them to Language B, sends those outputs off to an IoT digest service for Ring devices, (So back to AWS) just to achieve the goal of turning on the hallway light when my phone is detected in that room!
(Me, personally; I just use my finger to flick the switch provided. :-p)

I think this is shown clearly in that I can happily comprehend starting off with 69 BZip2'd files on a local disk and topping/tailing/concatenating them together by hand using dd with a bit of integrity computation at the end (So long as I know which integrity hash is used and where in the tail it needs to go) but ask me to achieve the same thing through an Airflow scheduler running on Kubernetes and I just don't have a clue where to start.

I'll update the ticket again with more information and I will spin off another ticket about the lack of idempotency in the *recombine tasks, so we can look at retro-fitting the safety measures in the dd operations, as you suggest.

That's good, though please take care to remember that dd's nickname of disk destroyer is very well earned, which is a reason why the script I composed yesterday actively inhibits attempts to run under root. dd is an absolutely fantastic tool for data handling and conversion (I've even used it to digest and store audio streams!) but it's no different to a chainsaw; It'll cleave through fallen trees and firewood like they're not even there, but a chainsaw will just as happily cleave through human flesh too and doesn't know the difference between them. Only Odin knows how many people have destroyed their OS and all of their data by getting a dd command wrong by a single byte...
ᛒᚱᛟᚴᛂᚾ ᚢᛁᚴᛁᚾᚷ (Broken Viking)

Hi again @BTullis, and thanks for giving the server a budge/getting the dumps re-done! That saves me a 69-part download and assembly stage¹ and gets the dump back up much quicker than I could ever have managed! :-)

You're very welcome. Thanks again (and also to @OloffTheMeta) for highlighting the data quality issue.

(¹ - Easy enough in itself; But given I'm doing 95% of my upload work on a 1st gen Raspberry Pi with 256MB RAM and a limitation to 32GB SDHC cards - Equipment low-powered and cheap enough to be left sat unattended on a friends unlimited cable connection with his awareness and consent - A 25GiB dump rebuild is going to come very close to the ceiling! :-D)

Great stuff! We're all in favour of power-efficient and innovative hardware platform designs, with collaboration at their heart.
Perhaps there is scope for a diff post, or similar, about how you are helping to archive and share these data files in such a resource-conscious way. I'll ask around on this end, if that's OK with you.

I pulled the previous checksums last night to validate the index parts I'd downloaded and noticed there wasn't a checksum for enwiki-20250901-pages-articles.xml.bz2 in it ...

Is this resolved now, as far as you can tell, or is there still a data quality issue regarding the checksums?
I checked to make sure that this filename is listed in the combined md5 and sha1 files, but I haven't checked to see if any of the individual checksum files are missing.

image.png (192×837 px, 102 KB)

And that's a point: Once I can get my brain back into a clear enough place to get the Data Dump usage guides finished up and translated into the 16 languages I'm presently mirroring, I need to get on it.

Again, great work, thanks! Perhaps this is a comment that would be better made on your article's talk page, but I might suggest that you highlight at an early stage specifically to which element of the dump (i.e. $wiki-$YYYYMMDD-pages-articles-multistream.xml.bz2) you are referring. You have said on that page:

This dump of data from Wikipedia contains all of the article content from the main namespace as it stood on the date the dump was taken.

...but you haven't indicated which of the dump files you mean. There are other files within each monthly dump that do contain user pages, edit histories, talk pages etc. so maybe it would be helpful to signpost any visitors to your page towards these or some of the other user documentation regarding the different dump types. e.g. https://meta.wikimedia.org/wiki/Data_dumps.

I have had a look at Dumps/2 but I seem to remember there was so much in the project it overwhelmed me very quickly...

I agree with you that it would be a daunting prospect to try to learn about the entirety of the way that Dumps 2 works. What I was trying to do by mentioning this is to reassure you that we are actively working to modernise the way that dumps generation works, for everyone's benefit. The new system is based on our Event Platform and uses event-driven architecture and so can scale much more efficiently than the legacy system.

To illustrate this; we used to depend on importing the dumps v1 into our Data Lake every month to support the work of data analysts etc. However, we now maintain an Iceberg table (Mediawiki_content_history_v1) containing the same information, which is updated on a daily cadence. It is this table that will be used to generate the dumps for publication.
This solution is much more scalable and manageable than using a mediawiki maintenance script and PHP, even though that mediawiki instance now runs as an Airflow task under Kubernetes.

However, we have only managed to include a subset of the dump types in this current scope of work on dumps2, so the file types that will soon be created this way will be:

  • $wiki-$YYYYMMDD-pages-meta-current.xml.bz2
  • $wiki-$YYYYMMDD-pages-meta-history.xml.bz2

You'll be able to see from T400507#11060471 that we are working hard on trying to find non-breaking options for a smooth integration and transition from the legacy to the more modern system.

You'll also notice that the parts of the dumps that you work with, i.e.

  • $wiki-$YYYYMMDD-pages-articles-multistream.xml.bz2

... are not yet included in this phase of the dumps 2 migration work. I'm sure that they will be migrated before long, and when they do they will likely be sourced from this Iceberg table: Mediawiki_content_current_v1.

Until then, we will have to maintain the dumps 1 code and its use of dd to concatenate the multistream bzip files. As such, I am particularly grateful to you for your investigation work in: T403793#11179344 and your suggestions inT403793#11178958.

Only Odin knows how many people have destroyed their OS and all of their data by getting a dd command wrong by a single byte...

I am quite certain that I am on that list, as I also have a long and chequered history with using dd, over the decades. :-)

As the original data corruption instance that was reported is now fixed, I'll resolve this issue, but I will create a follow-up ticket to address the non-idempotency of the *recombine jobs and reference this one.