[Story] Compress JSON data dumps in Bzip2
Closed, ResolvedPublic

Description

Request: Currently the JSON dumps are compressed using gzip. I propose to also provide a file compressed with bzip2.

Reason: I've been working with people who would like to process WikiData stuff in Hadoop/Spark. Inside of that environment, bzip2 is better supported than gzip because of the block compression strategy that it uses. Currently, we need to recompress json dumps in order to take full advantage of these distributed processing frameworks. It would be very helpful for us and our workflow if the dumps could be provided in a bzip2 compressed format.

See http://stackoverflow.com/questions/6511255/why-cant-hadoop-split-up-a-large-text-file-and-then-compress-the-splits-using-g and http://stackoverflow.com/questions/14820450/best-splittable-compression-for-hadoop-input-bz2.

Halfak created this task.Oct 11 2015, 6:51 PM
Halfak updated the task description. (Show Details)
Halfak raised the priority of this task from to Needs Triage.
Halfak added a project: Wikidata.
Halfak moved this task to incoming on the Wikidata board.
Halfak added a subscriber: Halfak.
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptOct 11 2015, 6:51 PM
Halfak set Security to None.Oct 11 2015, 6:51 PM
Halfak added a subscriber: NealMcB.
hoo added a subscriber: hoo.Oct 11 2015, 6:53 PM
hoo triaged this task as Normal priority.Oct 11 2015, 7:00 PM

We discussed this before (and I actually tested a few compression algorithms) and we came to the conclusion that this was not yet needed. It might be time to revisit that decision (although xz performed way better than bzip2 AFAIR, so that might also be something to consider).

hoo added a subscriber: daniel.

xz does not have the nice built in support in distributed processing frameworks that bz2 has.

It may be worth re-iterating that I am not concerned about compression ratio. The purpose of this task is to make wikidata JSON dumps easy to process in Hadoop/Spark.

A quick reading suggests that Hadoop/Spark has no native support at all for xz.

Change 245850 had a related patch set uploaded (by Hoo man):
Publish bzip2 compressed Wikidata json dumps

https://gerrit.wikimedia.org/r/245850

Lydia_Pintscher renamed this task from Compress JSON data dumps in Bzip2 to [Story] Compress JSON data dumps in Bzip2.Oct 14 2015, 12:28 PM

Change 245850 merged by Dzahn:
Publish bzip2 compressed Wikidata json dumps

https://gerrit.wikimedia.org/r/245850

hoo closed this task as Resolved.Oct 22 2015, 8:47 PM
hoo removed a project: Patch-For-Review.
hoo moved this task from Review to Done on the Wikidata-Sprint-2015-10-13 board.
Hydriz added a subscriber: Hydriz.Oct 28 2015, 7:36 AM

@hoo Will this also be done for the beta ttl dumps?