At this time, wikidata provides JSON dumps compressed with gzip or bzip2. However, neither are not optimal:
- the gzip dump is quite big (about 100% larger than bzip2)
- the bzip2 dump takes a lot of time to decompress (estimated 7h on my laptop)
As a consumer of these dumps, it would be nice to have a format that compresses well but also has good decompression speeds. I tested Zstandard and it performs much better than either of those two variants:
- decompression (with default compression level settings) is much faster: takes about 15 minutes on my laptop (CPU bound) (this might even be faster than gzip, I didn't have enough SSD space to test how well gzip performs)
- the size at default settings is very close to bzip2 (37.7 GB compared to ~35 GB that bzip2 produces)
This directly affects processing speed of tools operating on these dumps.