Page MenuHomePhabricator

[Story] Compress JSON data dumps in Bzip2
Closed, ResolvedPublic

Description

Request: Currently the JSON dumps are compressed using gzip. I propose to also provide a file compressed with bzip2.

Reason: I've been working with people who would like to process WikiData stuff in Hadoop/Spark. Inside of that environment, bzip2 is better supported than gzip because of the block compression strategy that it uses. Currently, we need to recompress json dumps in order to take full advantage of these distributed processing frameworks. It would be very helpful for us and our workflow if the dumps could be provided in a bzip2 compressed format.

See http://stackoverflow.com/questions/6511255/why-cant-hadoop-split-up-a-large-text-file-and-then-compress-the-splits-using-g and http://stackoverflow.com/questions/14820450/best-splittable-compression-for-hadoop-input-bz2.

Event Timeline

Halfak raised the priority of this task from to Needs Triage.
Halfak updated the task description. (Show Details)
Halfak added a project: Wikidata.
Halfak moved this task to incoming on the Wikidata board.
Halfak subscribed.
Halfak added a subscriber: NealMcB.
hoo triaged this task as Medium priority.Oct 11 2015, 7:00 PM

We discussed this before (and I actually tested a few compression algorithms) and we came to the conclusion that this was not yet needed. It might be time to revisit that decision (although xz performed way better than bzip2 AFAIR, so that might also be something to consider).

xz does not have the nice built in support in distributed processing frameworks that bz2 has.

It may be worth re-iterating that I am not concerned about compression ratio. The purpose of this task is to make wikidata JSON dumps easy to process in Hadoop/Spark.

A quick reading suggests that Hadoop/Spark has no native support at all for xz.

Change 245850 had a related patch set uploaded (by Hoo man):
Publish bzip2 compressed Wikidata json dumps

https://gerrit.wikimedia.org/r/245850

Lydia_Pintscher renamed this task from Compress JSON data dumps in Bzip2 to [Story] Compress JSON data dumps in Bzip2.Oct 14 2015, 12:28 PM

Change 245850 merged by Dzahn:
Publish bzip2 compressed Wikidata json dumps

https://gerrit.wikimedia.org/r/245850

hoo removed a project: Patch-For-Review.
hoo moved this task from Review to Done on the Wikidata-Sprint-2015-10-13 board.

@hoo Will this also be done for the beta ttl dumps?