Page MenuHomePhabricator

Provide zstd compression algorithm for dumps
Open, LowPublicFeature

Description

Consider zstd compression for dumps such as the Wikidata dump. The compressed size is similar to bzip2, but zstd is much faster to decompress. It would probably make sense to benchmark against Brotli (another modern compression algorithm) on the wikidata dump file.

https://facebook.github.io/zstd/
https://github.com/google/brotli

Event Timeline

Aklapper renamed this task from Try modern compression algorithm for dumps to Provide zstd compression algorithm for dumps.Dec 12 2022, 2:50 PM

The reason we go with bz2 is that it is block-oriented, allowing easier parallel processing of input files for platforms that rely on massive parallelization such as Hadoop. It also allows us to recover if part of a file is written out and the dump process aborts for some reason. It's possble that other compression algorithms could be adopted in a future re-architected version of the dumps.

Aklapper changed the subtype of this task from "Task" to "Feature Request".

Actually, due to how the zstd formats were designed, the current parallelization approach for bzip2 will probably work in the exact same way also with zstd. With the right command line option, the zstd tool will already distribute its input to all CPU cores for parallel compression (the reference zstd implementation is similar to pbzip2 in that respect). But one should also be able to split the input oneself, compress the shards in parallel on a set of machines, and then in the end simply concatenate the compressed outputs. Again, it’s using the same approach as pbzip2.

By the way: if/when this gets implemented, would it be possible to split the shards at newlines (for Wikidata: entity boundaries)? Afaik zstd will already try to break its input stream into compression blocks at newlines (again like pbzip2, to my knowledge). But if you do your own splitting before sending it to zstd, it’d be super nice if these large splits wouldn’t break an entity in the middle. Then, a data consumer could easily parallelize the processing (and not just the decompression) by splitting the zstd input at block boundaries. With the current Wikidata dumps, this is super painful because the bzip2 compression blocks aren’t necessarily aligned with lines, so entities sometimes get broken apart in the middle.

Disclaimer: I’m not a compression specialist,, so the above claims will definitely need to be verified.

I’ve tried Wikidata dumps in QuickStatements format with Zstd compression, and benchmarked it: https://github.com/brawer/wikidata-qsdump
File size shrinks to one third, and decompression is 150 times faster (on a typical modern cloud server) compared to pbzip2.

@Sascha also added benchmarks for different compressions algorithms to https://github.com/brawer/wikidata-qsdump. Thanks for that!