Page MenuHomePhabricator

Wikimedia Enterprise HTML dumps as bzip2 archive
Open, Needs TriagePublicFeature

Description

Feature summary:

My experience with Wikimedia Enterprise is as a community member using public https://dumps.wikimedia.org/other/enterprise_html/.

I would like to ask that HTML dumps are provided as simply bzip2 of the file contents (instead or in addition to current unusual tar.gz files of one file wrapped in tar)? Wikidata dumps are bzip2 of one json and that allows parallel decompression. Having both tar (why tar of one file at all?) and gz in there allows only serial decompression before you can process contents in parallel.

Another inspiration could be also Wikipedia XML dumps which are done with multistream bzip2 with an additional index file. That could be nice here too, if one could have an index
file and then be able to immediately jump to a JSON line for corresponding articles.

In any case, I think any of those approaches would be better than current tar.gz approach. If this was an ask from Enterprise users I am a bit surprised. Have other options been offered to them as an option to pick from? JSONL I think is a good choice. But how is that then compressed is surprising.

Use case(s):

Main use case is to allow parallel processing of the archive, both on one machine (which is what Wikidata dump's approach enables) or even on multiple machines (which is what Wikipedia XML dump's approach enables).

Benefits:

Besides faster processing, another reason is that bzip2 has generally better compression than gzip.

Event Timeline

Hey @Mitar - thanks for making this ticket and appreciate your response on the wikitech-l thread. Super helpful feedback.

Some context on the reasoning behind the decisions:

gz vs. bzip2: So our current approach was born out of our need of fastest compression. In the current product, we are bundling these dumps daily for each supported wiki project and for larger language projects can take a very long time to complete. We compared bzip2 and gz at the time and while bzip2 had better compression sizes, gz beat it out in time. We have optimized for speed here, if we can find a way to use bzip2 at the same speed, i think for the reasons you stated it would be smart. Something we can revisit.

multi-stream v. one-file: In general, we started off on a single file approach as we had received feedback that folks would likely decompress and parallel process the large ndjson anyways. I want to do a bit more research with the team and bring these back up, but I think your point is a good one and worth revisiting now that our product is a bit more mature.

I'm going to explore this a bit more and keep it in our product backlog to discuss further.

Thank you for your response. I am surprised that you find gzip faster - it is only faster if you look at single-threaded performance. But when running on multiple cores (which bzip2 has easier time, too), bzip2 wins again. Not to mention that then you can decompress it in parallel, too. So not sure on which machines are you compressing, but I would be surprised if it does not have few cores. Try lbzip2. But maybe you are using a parallel gzip implementation, too? If so, it would be useful to note which one, because maybe then using the same tool for decompression might enable parallel decompression even with gzip.

From my experience, the slowest is decompression. So it is not as useful to have multiple files there. :-) Only if you are making multiple passes, then you can decompress all files (which costs disk space) first and then in parallel process each file multiple times.

We use Go to generate those dumps and we needed full control over the compression process to use language's concurrency features so we were limited to go implementations, we use pgzip to compress the files as we have found this to be the fastest implementation we were able to find at the moment.

Interesting. I also use Go to parse those dumps (see library here) (it would be nice if we could use shared public Go struct representation of JSON) and am using pbzip2 to decompress bzip2 which supports parallel decompression. But I also cannot find any parallel compression for bzip2. So it is an interesting mismatch: pgzip has parallel compression implementation but serial decompression, while pbzip2 has parallel decompression, but not compression yet.

Are you sure you need to do compression inside the Go process and cannot just output JSONs to stdout and then compress it with lbzip2? If you are interested in speed this should probably be the fastest: C implementations of compression algorithms are still faster than Go ones.

Have you evaluated also Zstandard? I have some time ago tested it and it performers really very well. I have not yet tested Go libraries though.

A while ago we tried using bzip i don't really remember what particular library we have used but we just couldn't make it work fast enough to hit our time constraints. I highly appreciate all of the suggestions, thanks. I'll dig into lbzip2 and Zstandard to see if we make use of them.

It would help if it was documented what the "correct" way to decompress these dumps is, currently the decompression step (using plain tar) takes longer than my script that parses the entire dump.

I don't really understand why compression is a priority over decompression, that seems totally backwards to me. The files are only compressed once, in a very tightly controlled environment (i.e. we can use whichever tools we want and throw whatever hardware at it as necessary), but are going to be decompressed by every user, so the most flexible and fastest format there will be way way more useful.

You should use a parallel gzip decompressor. Just using standard gzip (which is invoked by tar) is not that.

I think it should be evaluated if parallel gzip decompression is really slower than parallel bzip2 decompression.

I think Zstandard is generally fastest though.

You should use a parallel gzip decompressor. Just using standard gzip (which is invoked by tar) is not that.

Thanks. For reference, I went with tar -I pigz -xf /public/dumps/public/other/enterprise_html/runs/20230220/enwiki-NS0-20230220-ENTERPRISE-HTML.json.tar.gz based on this SO answer.

I agree that zstd is probably going to be the easiest option for fast compression in Go.

However I'd like to point out the benefit of using something other than gzip: it just makes simple data analysis through piping (example) much faster and easier.

On a shared machine with 48 cores I convert the dump from gz to bzip2 and the bottleneck is actually pigz, which just doesn't manage to use that many threads and peaks at around 300 MiB/s output with a couple CPUs used:

$ pigz -dc enwiki-NS0-20230301-ENTERPRISE-HTML.json.tar.gz | pv | lbzip2 -c9 > enwiki-NS0-20230301-ENTERPRISE-HTML.json.tar.bz2 
626GiB 0:37:21 [ 286MiB/s]

In less than 600 min of CPU time I get a bzip2 file which is 64 GB instead of 107 GB (which is also useful if I happen to be I/O bound by a slow reading disk), and I can extract it at about 1500 MiB/s with lbzip2. If I manage to pipe it to something fast enough like grep, this is a huge difference.

Judging from curl -s https://dumps.wikimedia.org/other/enterprise_html/runs/20230301/ | grep -Eo " [0-9]{3,}" | paste -sd+ | bc, your compressed JSON is about 730 GiB, which uncompresses to about 6 times at much, which at 2085 MiB/s with 16 CPUs as stated at https://github.com/klauspost/pgzip#compression-1 means the compression step should be taking you less than 1 hour a day on a 16-core VM, while it might take you 10 times as much with lbzip2 at the speed described above (trivial to reduce by throwing some more CPU at it).

Anyway, even if the daily runs cannot be made fast enough, or if it's too tedious to change your code, it would be worthwhile to convert from gz to bzip2 when you transfer them to the public dumps server, because that will save a lot of time to everyone who downloads from there. It might even make your transfer faster, because the dumps server usually has quite slow network.

This ticket should be split into discussions about compression, vs. "tar".

I'm here because of the tar layer, it's quite inconvenient for our use case which is to stream the dumps through a processor without decompressing to disk. We've discovered the --to-stdout argument to tar which does enable slightly awkward but decent streaming if our processing were simply on the commandline. However, we want to have the processing integrated into an application and although lots of libraries are already available to stream gzip, bzip2, zstandard, etc. there is very little available to process the tar layer. Building a component which streams tar from an abstract input and streams the contents of one file puts us in obscure edge-case territory, unfortunately.

I see that the tar layer was included to accomodate multiple files, but judging by the current dump file directory listing it seems that we're going in the opposite direction, anyway? This is consistent with other dump files and makes sense to me—I'm processing NS0 and I have absolutely no use for pages from other namespaces so it would be a waste to bundle the files together.

Also, thank you for this fantastic new data source! It's incredibly valuable for my current project, nothing else would have included the information I need (other than running the parser on every page as our forebears have done).

I will respond about tar layer in T332045 you made.