Page MenuHomePhabricator

enwiki-20151201-pages-articles-multistream.xml.bz2 and others are corrupt
Closed, ResolvedPublic

Description

From OTRS ticket 2015120510001671

Hi,

I first apologize for contacting you since I know you might have nothing to do with Wikipedia dumps. I just did not find any email associated with Wikipedia dumps. So I would appreciate if you can forward my email to the right department.
I downloaded the most recnet Wikipedia dump (enwiki-20151201-pages-articles-multistream.xml.bz2 12.4 GB) both yesterday and today. After download is completed I cannot uncompress the file and it gives me an error. I went to check for the checksum, but unfortunately you have not put the checksum (for this specific file) on your website as of now. So I was wondering, could you please double check that the compressed file is not corrupted?

Thank you,
Amir

Will direct them here

Event Timeline

Reedy raised the priority of this task from to Needs Triage.
Reedy updated the task description. (Show Details)
Reedy subscribed.
Reedy set Security to None.
ArielGlenn triaged this task as Medium priority.Dec 15 2015, 8:38 PM

I'll be having a look at this tomorrow. Just out of curiosity (though I doubt it's related), what version of bzip2 are you running, and on what OS (linux, macos, that other one)?

$ uname -a
Linux thutmose 4.2.5-1-ARCH #1 SMP PREEMPT Tue Oct 27 08:13:28 CET 2015 x86_64 GNU/Linux
$ bzip2 --version
bzip2, a block-sorting file compressor. Version 1.0.6, 6-Sept-2010.

These are the checksums that I have got.

$ md5sum enwiki-20151201-pages-articles-multistream.xml.bz2

68a318c033b3e4e406191bd5343480b5 enwiki-20151201-pages-articles-multistream.xml.bz2

$ sha1sum enwiki-20151201-pages-articles-multistream.xml.bz2

34cd7e3e5bb9b869d21f198c12fd0ce50288a51b enwiki-20151201-pages-articles-multistream.xml.bz2

$ bzip2 -dk enwiki-20151201-pages-articles-multistream.xml.bz2

bzip2: Data integrity error when decompressing.

Input file = enwiki-20151201-pages-articles-multistream.xml.bz2, output file = enwiki-20151201-pages-articles-multistream.xml

It is possible that the compressed file(s) have become corrupted.
You can use the -tvv option to test integrity of such files.

You can use the `bzip2recover' program to attempt to recover
data from undamaged sections of corrupted files.

bzip2: Deleting output file enwiki-20151201-pages-articles-multistream.xml, if it exists.

Looks like I can confirm such an issue exists. Is there a reason why the checksums were not available, or is it skipped when producing the dump?

I see the bzip2 error on the December dump and not the November one indeed. Nothing in the dump run logs indicates an issue with the file's creation. The easiest thing its to just rerun the job since it takes about a day. In the meantime I'll inspect the current file and see what can be discerned from it, if anything.

The md5sum should be present for any completed job. That it's not present for this file may be a clue also.

I've started the rerun although you can't tell yet on the dump web page. You should be able to follow its progress later today.

In looking at the old file, though it can only read up to a certain point, the end of the file is fine, with a closing <mediawiki> tag even. This shows that the job completed but there was (apparently) corruption in the middle. I'll see about isolating the section so I can at least try looking at the bad data, though I'm not sure whether we'll learn much from that.

I have tracked it down to a corrupt block in the middle. What caused the corruption? No idea. And the checks we might do on the output file would not have caught it because we could just check that the end of the file is ok; uncompressing the entire file for every fie we create is pretty espensive timewise.

The corruption means lost bits, which means that the offsets for all following blocks are wrong in the index file. When I add the correcting offset of 4618 (found by scanning the compressed file by hand from the bad block and looking for the signature starting character string BZh91AY&SY) to the index for any following blocks, they decompress correctly.

The new output file is ready; please verify that it works for you.

I imagine it's fine, but "bzip2 -tvv" will tell you for sure...

(I'm now using the last month's dump, which is good, and I don't have the time or disk space to verify the new one - sorry!)

Well that's interesting. The new file has exactly the same issue in exactly the same place. Time to look at the code.

Nemo_bis renamed this task from enwiki-20151201-pages-articles-multistream.xml.bz2 possibly corrupt? to enwiki-20151201-pages-articles-multistream.xml.bz2 and others are corrupt.Jan 11 2016, 7:54 AM
Nemo_bis added a subscriber: Halfak.

I don't know how to unmerge T122682 but I've reopened it, it's a separate issue.

I have managed to reproduce the compression issue on a much smaller file (using the 100 pages that would be contained in the corrupted block that we saw in the en wp multistream file). On to debugging.

Any alteration of the data to be compressed (besides removal of the site info and the mediawiki tags) results in a noncorrupt compression. So it's something in the exact byte sequence. After stripping the site and mw tags, when I run bz2recover on the compressed output it claims to find a second block with just the trailing part of the page close tag in it. This block uncompresses correctly. When I look at the file with od -c I don't see the bz2 stream start sequence in there. Maybe it's been bitshifted, ugh. In any case only one bz2 stream is actually written by the recompressxml program.

Change 264337 had a related patch set uploaded (by ArielGlenn):
dumps multi stream bz2 files: fix corruption in writing end of BZ2 stream

https://gerrit.wikimedia.org/r/264337

Found it! Forgot to reset the output stream start and available markers on first time through for BZ_FINISH mode (when the bz2 stream is ended by the library). I'll test the change as soon as there's some available cpu on one of the snapshot hosts (they're in the middle of the monthly run now).

I should add that I tested this fix on my standalone file of the 100 problematic pages and got good output.

Actually since this is a small standalone program I am testing it on ms1001.wikimedia.org right now where it won't impact dumps generation.

Test passed, new file generated and corruption-free though not made available yet for download. Will build packages with the fix shortly.

Change 264337 merged by ArielGlenn:
dumps multi stream bz2 files: fix corruption in writing end of BZ2 stream

https://gerrit.wikimedia.org/r/264337

Packages built for trusty (only os we have in use just now on snapshots), added to repo and deployed on all snapshot hosts.