Page MenuHomePhabricator

Provide wikidata JSON dumps compressed with zstd
Open, Needs TriagePublic

Description

At this time, wikidata provides JSON dumps compressed with gzip or bzip2. However, neither are not optimal:

  • the gzip dump is quite big (about 100% larger than bzip2)
  • the bzip2 dump takes a lot of time to decompress (estimated 7h on my laptop)

As a consumer of these dumps, it would be nice to have a format that compresses well but also has good decompression speeds. I tested Zstandard and it performs much better than either of those two variants:

  • decompression (with default compression level settings) is much faster: takes about 15 minutes on my laptop (CPU bound) (this might even be faster than gzip, I didn't have enough SSD space to test how well gzip performs)
  • the size at default settings is very close to bzip2 (37.7 GB compared to ~35 GB that bzip2 produces)

This directly affects processing speed of tools operating on these dumps.

Event Timeline

Have you tried lbzip2? You can specify a number of threads and get some speedup for compression or decompression, even from pipes.

So I tried lbzip2, here's the result (on a VM sever with 2 cores, 2.1GHz, the decompression is CPU bound):

$ time lbzip2 -n2 -v -d -c wikidata-20190506-all.json.bz2 | cat > /dev/null                                                                                                                                             
lbzip2: decompressing "wikidata-20190506-all.json.bz2" to stdout
lbzip2: "wikidata-20190506-all.json.bz2": compression ratio is 1:20.790, space savings is 95.19%

real    228m12.850s
user    444m36.440s
sys     10m48.860s

I will rerun the same test with zstd on the same machine.

Now the same with zstd:

$ time zstdcat -v -d wikidata-20190506-all.json.bz2 | cat > /dev/null 

real    3m48.657s
user    0m3.792s
sys     0m58.768s

here's the sizes:

35G     wikidata-20190506-all.json.bz2
39G     wikidata-20190506-all.json.zstd-def

so zstd is about the same size compressed but is about 10 times faster decompressing.
Also, during the zstd test, CPU was not maxed out (the test was disk bound)

Impressive. Would you be willing to do a compression timing test too, or is that prohibitive given your available disk space?

I don't have enough disk space for a compression test, that's correct.

But I can do a zstd decompression -> zstd compression test.

$ time zstdcat -v -d wikidata-20190506-all.json.bz2 | zstd > /dev/null                                                                                                                                                  

real    4m5.341s
user    2m22.452s
sys     1m7.912s

I've run some tests using the (nfs-mounted) filesystem to which our dumps are written in production.

ariel@snapshot1008:/mnt/dumpsdata/otherdumps/wikibase/wikidatawiki/20190513$ time (zcat wikidata-20190513-all.json.gz | gzip > /mnt/dumpsdata/temp/ariel/wikidata-20190513-all.json.gz)
real	163m25.709s
user	240m14.524s
sys	8m42.344s
ariel@snapshot1008:/mnt/dumpsdata/otherdumps/wikibase/wikidatawiki/20190513$ time (zcat wikidata-20190513-all.json.gz | zstd -q > /mnt/dumpsdata/temp/ariel/wikidata-20190513-all.json.zst)
real	84m17.266s
user	91m34.532s
sys	9m23.196s
ariel@snapshot1008:/mnt/dumpsdata/otherdumps/wikibase/wikidatawiki/20190513$ time (zcat wikidata-20190513-all.json.gz | lbzip2 -n 1 > /mnt/dumpsdata/temp/ariel/wikidata-20190513-all.json.bz2)
real	554m59.818s
user	653m24.460s
sys	13m49.056s
ariel@snapshot1008:/mnt/dumpsdata/otherdumps/wikibase/wikidatawiki/20190513$ time (zcat wikidata-20190513-all.json.gz | lbzip2 -n 2 > /mnt/dumpsdata/temp/ariel/wikidata-20190513-all.json.bz2)
real	284m41.349s
user	643m26.664s
sys	14m5.700s
ariel@snapshot1008:/mnt/dumpsdata/otherdumps/wikibase/wikidatawiki/20190513$ time (zcat wikidata-20190513-all.json.gz | bzip2 > /mnt/dumpsdata/temp/ariel/wikidata-20190513-all.json.bz2)
real	2000m16.749s
user	2036m42.608s
sys	10m32.968s

Summary:

  • wall lock time for bzcat:
  • zstd: 1 hour 25 minutes
  • lbzip2 1 thread: 9 hours 15 minutes
  • lbzip2 2 threads: 4 hours 45 minutes
  • bzcat: 33 hours 20 minutes

I need to double check memory usage but barring issues with that, this looks good. What do Wikidata folks think?

I tried zstd some time ago and found that with default settings it's bigger than bz2 and with max settings it's rather slow, so I did not proceed. But I agree that decompression speed might matter too, I did not consider that.

So if there's no problem storing another dump format then it might be ok adding it, but I am not sure about replacing bz2 with it - bz2 seems to still get the best compression overall without being glacial in speeds. We could maybe even do it in parallel with gz->bz2 if we have enough CPU on that machine,

I am not sure we want zstd versions for every dump, but maybe some of them (biggest ones)?

I don't want to replace existing compression formats; this would be in addition to what we have.

I'll have to look at the graphs to see how we are as far as CPU usage goes.

Let's just do the json dump for now, if we do this.

We need some timing tests on these: is there a happy medium between 'best settings for compression' and 'best settings for speed'? What are we looking at in terms of execution time and space, if we add this step? We'd continue to provide bz2s I guess, since those are handy for processing into Hadoop, being well-suited to parallel processing.

As a reference see also this discussion.

I think the problem with bzip2 is that it is currently singlestream so one cannot really decompress it in parallel. Based on this answer it seems that this was done on purpose, but since 2016 maybe we do not have to worry about compatibility anymore and just change bzip2 to be multistream? For example, by using this tool.

But from my experience (from other contexts), zstd is really good. +1 on providing that as well, if possible from disk space perspective.

I think by supporting parallel decompression, then issue https://phabricator.wikimedia.org/T115223 could be addressed as well.

lbzip2 decompresses in parallel as well. We use that for compression of the SQL/XML dumps.

Are you saying that existing wikidata json dumps can be decompressed in parallel if using lbzip2, but not pbzip2?

Are you saying that existing wikidata json dumps can be decompressed in parallel if using lbzip2, but not pbzip2?

lbzip2 is format-compatible with bzip2 and can read bzip2 or lbzip2 compressed files and use multiple cores to decompress, indeed. pbzip2 should also work forr that matter.

OK, so it seems the problem is in pbzip2. It is not able to decompress in parallel unless compression was made with pbzip2, too. But lbzip2 can decompress all of them in parallel.

See:

$ time bunzip2 -c -k latest-lexemes.json.bz2 > /dev/null

real	1m0.101s
user	0m59.912s
sys	0m0.180s
$ time pbzip2 -d -k -c latest-lexemes.json.bz2 > /dev/null

real	0m57.662s
user	0m57.792s
sys	0m0.180s
$ time lbunzip2 -c -k latest-lexemes.json.bz2 > /dev/null

real	0m13.346s
user	1m35.951s
sys	0m2.342s
$ lbunzip2 -c -k latest-lexemes.json.bz2 > serial.json
$ pbzip2 -z < serial.json > parallel.json.bz2
$ time lbunzip2 -c -k parallel.json.bz2 > /dev/null

real	0m16.270s
user	1m43.004s
sys	0m2.262s
$ time pbzip2 -d -c -k parallel.json.bz2 > /dev/null

real	0m17.324s
user	1m52.946s
sys	0m0.659s

Size is very similar:

$ ll parallel.json.bz2 latest-lexemes.json.bz2 
-rw-rw-r-- 1 mitar mitar 168657719 Jun 15 20:36 latest-lexemes.json.bz2
-rw-rw-r-- 1 mitar mitar 168840138 Jun 20 07:35 parallel.json.bz2

lbzip2 decompresses in parallel as well. We use that for compression of the SQL/XML dumps.

Yes, the problem is that bzip2 is just really slow to decompress in general. You need to use a lot of cores before it gets faster than single-thread gzip.