Feature summary:
My experience with Wikimedia Enterprise is as a community member using public https://dumps.wikimedia.org/other/enterprise_html/.
I would like to ask that HTML dumps are provided as simply bzip2 of the file contents (instead or in addition to current unusual tar.gz files of one file wrapped in tar)? Wikidata dumps are bzip2 of one json and that allows parallel decompression. Having both tar (why tar of one file at all?) and gz in there allows only serial decompression before you can process contents in parallel.
Another inspiration could be also Wikipedia XML dumps which are done with multistream bzip2 with an additional index file. That could be nice here too, if one could have an index
file and then be able to immediately jump to a JSON line for corresponding articles.
In any case, I think any of those approaches would be better than current tar.gz approach. If this was an ask from Enterprise users I am a bit surprised. Have other options been offered to them as an option to pick from? JSONL I think is a good choice. But how is that then compressed is surprising.
Use case(s):
Main use case is to allow parallel processing of the archive, both on one machine (which is what Wikidata dump's approach enables) or even on multiple machines (which is what Wikipedia XML dump's approach enables).
Benefits:
Besides faster processing, another reason is that bzip2 has generally better compression than gzip.