Page MenuHomePhabricator

Investigate use of bz2 decompression tools on multistream files
Open, HighPublic

Description

i'd like to be able to write some of the other bz2 combined files the cheap way, by just concatenating smaller files together, and generating separate header/footer files like we do for some gz files. This will save us a LOT of time during dumps generation. Before we go this route, I want to make sure that we're not disenfranchising a bunch of dumps users; each platform ought to have at least one working program that will decompress these 'multiple stream' files properly.

Here are the utils that need to be checked, some of which I won't have access to because I only have linux here:

  • 7-zip
  • p7zip
  • WinRAR
  • WinZip
  • cygnus bzip2 for windows

We should also check whether mwdumper will read them, so that if not, the docs can be updated to suggest a pipe, i.e. bzcat | mwdumper, as they do for 7z files.

Any other frequently used utils missing from the list?

Event Timeline

ArielGlenn triaged this task as High priority.Dec 4 2019, 9:52 PM
ArielGlenn created this task.
Samat added a subscriber: Samat.Dec 4 2019, 10:10 PM

Any other frequently used utils missing from the list?

Not sure about these. For 7z I just use 7z e -so, while I usually decompress the multistream bz2 with lbzip2, or pbzip2 if that's not available. Just yesterday I noticed that on one of my machines I have a pbzip2 which doesn't decompress the file fully:

$ pbzip2 -dck enwiki-20191120-pages-articles-multistream.xml.bz2 | grep -c "<title>"
23017
$ bzip2 -dck enwiki-20191120-pages-articles-multistream.xml.bz2 | grep -c "<title>"
19799262

On another machine it was all well, despite being the same file and same version (pbzip2 v1.1.9), both on a Debian-like distribution. There's probably no issue with more recent versions that I have on Fedora (v1.1.12).

@Nemo_bis thanks for testing! Can you ldd the pbzip2 on both boxes and tell me if there's a difference between the one that succeeds and the one that fails?

I think the bzip2 api doesn't handle the multistream transparently, so tools coded using that would probably be affected.

Samat added a comment.Dec 5 2019, 10:15 PM

I decompressed the files using Total Commander's own bzip2 plugin, without any problem.
The only issue I experienced, that in case of the multistream file, it cannot show the progress of the process, so I need to wait and believe it is working and will be ready...

@Nemo_bis thanks for testing! Can you ldd the pbzip2 on both boxes and tell me if there's a difference between the one that succeeds and the one that fails?

Sure. This is the one which fails:

$ ldd /usr/bin/pbzip2
        linux-vdso.so.1 (0x000077c123933000) 
        libbz2.so.1.0 => /lib/x86_64-linux-gnu/libbz2.so.1.0 (0x000077c1236e1000) 
        libpthread.so.0 => /lib/x86_64-linux-gnu/libpthread.so.0 (0x000077c1236c0000) 
        libstdc++.so.6 => /lib/x86_64-linux-gnu/libstdc++.so.6 (0x000077c12353c000) 
        libm.so.6 => /lib/x86_64-linux-gnu/libm.so.6 (0x000077c1233b9000)
        libgcc_s.so.1 => /lib/x86_64-linux-gnu/libgcc_s.so.1 (0x000077c12339f000)
        libc.so.6 => /lib/x86_64-linux-gnu/libc.so.6 (0x000077c1231de000)
        /lib64/ld-linux-x86-64.so.2 (0x000077c123935000)
$ dpkg -l| grep bz
ii  bzip2                                                       1.0.6-9.2~deb10u1                   amd64        high-quality block-sorting file compressor - utilities
ii  bzip2-doc                                                   1.0.6-9.2~deb10u1                   all          high-quality block-sorting file compressor - documentation
ii  libbz2-1.0:amd64                                            1.0.6-9.2~deb10u1                   amd64        high-quality block-sorting file compressor library - runtime
ii  libbz2-dev:amd64                                            1.0.6-9.2~deb10u1                   amd64        high-quality block-sorting file compressor library - development
ii  libzeroc-ice3.7:amd64                                       3.7.2-4                             amd64        C++ run-time libraries for the Ice framework
ii  libzip4:amd64                                               1.5.1-4                             amd64        library for reading, creating, and modifying zip archives (runtime)
ii  libzmq5:amd64                                               4.3.1-4+deb10u1                     amd64        lightweight messaging kernel (shared library)
ii  libzstd1:amd64                                              1.3.8+dfsg-3                        amd64        fast lossless compression algorithm
ii  libzvbi-common                                              0.2.35-16                           all          Vertical Blanking Interval decoder (VBI) - common files
ii  libzvbi0:amd64                                              0.2.35-16                           amd64        Vertical Blanking Interval decoder (VBI) - runtime files
ii  libzzip-0-13:amd64                                          0.13.62-3.2                         amd64        library providing read access on ZIP-archives - library
ii  pbzip2                                                      1.1.9-1+b1                          amd64        parallel bzip2 implementation

This the one which works:

$ ldd /usr/bin/pbzip2
        linux-vdso.so.1 (0x00007fff2ddc9000)
        libbz2.so.1.0 => /lib/x86_64-linux-gnu/libbz2.so.1.0 (0x00007f42ed786000)
        libpthread.so.0 => /lib/x86_64-linux-gnu/libpthread.so.0 (0x00007f42ed569000)
        libstdc++.so.6 => /usr/lib/x86_64-linux-gnu/libstdc++.so.6 (0x00007f42ed1e7000)
        libm.so.6 => /lib/x86_64-linux-gnu/libm.so.6 (0x00007f42ecee3000)
        libgcc_s.so.1 => /lib/x86_64-linux-gnu/libgcc_s.so.1 (0x00007f42ecccc000)
        libc.so.6 => /lib/x86_64-linux-gnu/libc.so.6 (0x00007f42ec92d000)
        /lib64/ld-linux-x86-64.so.2 (0x00007f42edba7000)
$ dpkg -l| grep bz
ii  bzip2                             1.0.6-8.1                      amd64        high-quality block-sorting file compressor - utilities
ii  lbzip2                            2.5-2                          amd64        fast, multi-threaded bzip2 utility
ii  libbz2-1.0:amd64                  1.0.6-8.1                      amd64        high-quality block-sorting file compressor library - runtime
ii  pbzip2                            1.1.9-1+b1                     amd64        parallel bzip2 implementation

So it could be libbz from 1.0.6-9 on? Can someone test on another Debian 10 machine?

Weird, I see nothing in the changelog that looks likely: https://salsa.debian.org/debian/bzip2/blob/master/debian/changelog

In the meantime, here's 1.0.6-9.2 on a host here:

ariel@dumpsdata1003:/data/xmldatadumps/public$ bzcat cewiki/20191201/cewiki-20191201-pages-articles.xml.bz2 | md5sum
b26273295af4e53240b0f66e05625715  - 
ariel@dumpsdata1003:/data/xmldatadumps/public$ bzcat cewiki/20191201/cewiki-20191201-pages-articles-multistream.xml.bz2 | md5sum
b26273295af4e53240b0f66e05625715  -
ariel@dumpsdata1003:/data/xmldatadumps/public$ pbzip2 -dc cewiki/20191201/cewiki-20191201-pages-articles-multistream.xml.bz2 | md5sum
b26273295af4e53240b0f66e05625715  -

ariel@dumpsdata1003:/data/xmldatadumps/public$ dpkg -l | grep bz
ii  bzip2                                1.0.6-9.2~deb10u1           amd64        high-quality block-sorting file compressor - utilities
ii  libbz2-1.0:amd64                     1.0.6-9.2~deb10u1           amd64        high-quality block-sorting file compressor library - runtime
ii  pbzip2                               1.1.9-1+b1                  amd64        parallel bzip2 implementation

ariel@dumpsdata1003:/data/xmldatadumps/public$ ldd /usr/bin/bzip2
	linux-vdso.so.1 (0x00007ffdee17f000)
	libbz2.so.1.0 => /lib/x86_64-linux-gnu/libbz2.so.1.0 (0x00007fe331c6b000)
	libc.so.6 => /lib/x86_64-linux-gnu/libc.so.6 (0x00007fe331aaa000)
	/lib64/ld-linux-x86-64.so.2 (0x00007fe331c93000)

ariel@dumpsdata1003:/data/xmldatadumps/public$ ldd /usr/bin/pbzip2
	linux-vdso.so.1 (0x00007ffcf9b7d000)
	libbz2.so.1.0 => /lib/x86_64-linux-gnu/libbz2.so.1.0 (0x00007fd86d46d000)
	libpthread.so.0 => /lib/x86_64-linux-gnu/libpthread.so.0 (0x00007fd86d44c000)
	libstdc++.so.6 => /lib/x86_64-linux-gnu/libstdc++.so.6 (0x00007fd86d2c8000)
	libm.so.6 => /lib/x86_64-linux-gnu/libm.so.6 (0x00007fd86d145000)
	libgcc_s.so.1 => /lib/x86_64-linux-gnu/libgcc_s.so.1 (0x00007fd86d12b000)
	libc.so.6 => /lib/x86_64-linux-gnu/libc.so.6 (0x00007fd86cf6a000)
	/lib64/ld-linux-x86-64.so.2 (0x00007fd86d69a000)
ArielGlenn moved this task from Backlog to Active on the Dumps-Generation board.Dec 9 2019, 8:01 AM
ArielGlenn added a comment.EditedJan 8 2020, 10:49 AM

I'd like to move forward with switching to multiple stream files in all cases. Have sent another email to the list to see if we get any more interest or any pushback.

Tentative timetable: flipping the switch on March 1.

I have just tested the two files in the mail-list:

and I get a weird behaviour, but I don't know if it is just my fault.

I usually use the pages-articles.xml.bz2 dumps in 3 ways:

  1. A Java tool in Toolforge (Java 8 using library Apache Commons Compress, last version)
  2. A Python script to test regex (Python 2 and 3, package bz2)
  3. A bot running locally with pywikibot (Python 3)

I have tested the 3 methods with the BZ2 files:

  • The pages-articles.xml.bz2 works well with the 3 methods
  • The pages-articles-multistream.xml.bz2 fails with methods 1 (Java) and 2 (only with Python 2). It works well with Python 3.

In both failing methods, the error is in line 37.

  • In Java: org.xml.sax.SAXParseException; lineNumber: 37; columnNumber: 1; XML document structures must start and end within the same entity.
  • In Python 2: xml.sax._exceptions.SAXParseException: /Users/benja/Downloads/cewiki-20191201-pages-articles-multistream.xml.bz2:37:0: no element found

Then, I have unzipped the files (OSX Catalina, Unarchiver) and I have repeated the tests with the unzipped files and they pass OK with the 3 methods.

ArielGlenn added a comment.EditedJan 16 2020, 12:34 PM

@Benjavalero Thanks for testing! I think we can handwave about the python2 script, since Python 2 is officially EOL. The java tool concerns me however; can you give me a link to the tool, or even better, to its source? And also please let me know the exact command you run, with flags. I'll try to duplicate it here and see what's up. Thanks!

@Benjavalero I think you are using BZip2CompressorInputStream in your code? You must tell it that you want it to decompress multiple concatenated stream if there are any. See: https://commons.apache.org/proper/commons-compress/apidocs/org/apache/commons/compress/compressors/bzip2/BZip2CompressorInputStream.html Let me know if this works!

@ArielGlenn Yes, it works! And with this additional parameter, it works for both bzip files, so I can already adapt the tool.

Just a final doubt this task means that the dumps xxwiki-xxxxxxxx-pages-articles.xml.bz2 will be generated no more ?

@ArielGlenn Yes, it works! And with this additional parameter, it works for both bzip files, so I can already adapt the tool.
Just a final doubt this task means that the dumps xxwiki-xxxxxxxx-pages-articles.xml.bz2 will be generated no more ?

It will be generated, but it will have a few bz2 streams concatenated together. Not like the multistreams file which will have every 100 pages in a new stream though, as well as an index.

I've asked @JAllemandou to check the hadoop import tools too.

Hi @ArielGlenn, I tested reading yowiki-20200101-pages-articles-multistream.xml.bz2 successfully :)
Thanks for the heads-up :)