some lbzip2-based XML dumps cannot be processed by 7zip on Windows
Open, HighPublic

Description

I download each new ruwiki XML dump (from https://dumps.wikimedia.org/ruwiki/) for using in AWB database scanner. But last dump archives:

are broken, they cant'be unpacked with 7zip, see screenshot

(text means "data error"). Earlier dumps were unpacked successfully.

Restricted Application added a subscriber: Base. · View Herald TranscriptSat, Nov 3, 2:18 PM
MaxBioHazard triaged this task as High priority.Sat, Nov 3, 2:19 PM

I had the same problems starting with https://dumps.wikimedia.org/dewiki/20181020/dewiki-20181020-pages-articles.xml.bz2 So i asked developer of 7zip about this. He states bzip2 decompressor has a bug and lbzip2 compressor uses this. So he suggests not to use lbzip2 for compression. I do not know which tool wikipedia is using for compression, but it might be related. See discussion: https://sourceforge.net/p/sevenzip/bugs/2163/

@ArielGlenn can you say something about this problem?

Schnark added a subscriber: Schnark.Mon, Nov 5, 9:46 AM

I was able to uncompress https://dumps.wikimedia.org/dewiki/20181020/dewiki-20181020-pages-articles.xml.bz2 successfully, using a very old version of bzip2.

Schnark, please, try to unpack one of archives from my initial message. If it will be successfull and your bzip2 version is running under Win, please, send it to me.

So this task basically says "Do not use lbzip2 to create dump files due to https://sourceforge.net/p/sevenzip/bugs/1626/ "?

No error output running bzip2 -dk ruwiki-20181020-pages-articles.xml.bz2 with version 1.0.6-28 on Linux so I guess the file was extracted properly?

Would be nice, if someone can tell us, which tool wikipedia is using to compress files and if something was changed in the generation of dumps between 20181001 and 20181020. And yes, 7zip in Windows has problems to uncompress some files. Bzip2 in linux has no problems. But i haven't testet ALL dumps and i never had problems to uncompress dewiktionary-YYYYMMDD-pages-articles.xml.bz2 with 7zip in windows. So it seems, that size matters.

ArielGlenn moved this task from Backlog to Active on the Dumps-Generation board.Tue, Nov 6, 12:58 PM

For dumps where we recombine a collection of bz2 files into one (the so-called 'big wikis' along with enwiki and wikidatawiki), we now use lbzip2 for compression, because it can use multiple threads for a significant speedup when reading from a single input stream. This was enabled just before the Oct 20th run and is described here: T179059

It looks to me like the issue here is the dummy trees handling, documented both here: https://sourceforge.net/p/sevenzip/bugs/1626/#3f77 (dummy Huffman trees) and here: https://github.com/kjn/lbzip2/issues/17
It seems that most decoders don't do this (extra?) check that 7zip does in the case of dummy trees. PHP's version of bzip2 compression works fine with these files, as do various versions of bzip2.

While we wait to see whether the 7zip dev or the lbzip2 dev will patch first, I recommend using bzip2 for Windows, a command line utility. You can pick that up here: https://github.com/philr/bzip2-windows/releases and it has a link for the one dependency.

I made a test using bzip2 (the 64-bit version) for windows. It seems that it cannot handle files > 4GB.

bzip2 -dk "dewiki-20181101-pages-articles.xml.bz2"
bzip2: Input file dewiki-20181101-pages-articles.xml.bz2 is not a normal file.

What about this to avoid further user confusion:

  1. Change file extension from .bz2 to .lbzip2 for the files where lbzip2 is used.
  2. Add a notice on the download page that some decompressors have problems with these files.

The error message about it not being a normal file is caused by stat deciding that the file is not a 'regular file', i.e. it is a symbolic link or a directory or some other thing. Can you double check what you were running and the filename (and that it's a file)?

It is a normal file in Windows 7 64bit NTFS filesystem. But i used a trick to avoid bzip2 having to open the file. If bzip2 reads from stdin it works.

bzip2 -d < dewiki-20181101-pages-articles.xml.bz2 > out.xml

Great! Thanks for that. I'll add a note on the dumps page about the decoder issue.

Aklapper renamed this task from Last ruwiki XML dump archives are broken to lbzip2-based XML dumps cannot be opened by 7zip or bzip2 on Windows.Wed, Nov 7, 8:35 AM

Change 472107 had a related patch set uploaded (by ArielGlenn; owner: ArielGlenn):
[operations/puppet@production] Add a note about 7zip issues and bug reporting for dumps

https://gerrit.wikimedia.org/r/472107

ArielGlenn renamed this task from lbzip2-based XML dumps cannot be opened by 7zip or bzip2 on Windows to some lbzip2-based XML dumps cannot be processed by 7zip on Windows.Wed, Nov 7, 9:24 AM

I updated the title because 7zip can open them; it breaks in the middle of decoding, on files with dummy trees.

About bzip2, I'm not sure if there's a 64bit build around for that, or a build with the right open call.

Change 472107 merged by ArielGlenn:
[operations/puppet@production] Add a note about 7zip issues and bug reporting for dumps

https://gerrit.wikimedia.org/r/472107

I'll leave this open until either the 7zip or the lbzip2 developer patches their code, but I don't expect to take any more action on it until then.