[Story] Compress JSON data dumps in Bzip2
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	Halfak
	Oct 11 2015, 6:51 PM

Description

Request: Currently the JSON dumps are compressed using gzip. I propose to also provide a file compressed with bzip2.

Reason: I've been working with people who would like to process WikiData stuff in Hadoop/Spark. Inside of that environment, bzip2 is better supported than gzip because of the block compression strategy that it uses. Currently, we need to recompress json dumps in order to take full advantage of these distributed processing frameworks. It would be very helpful for us and our workflow if the dumps could be provided in a bzip2 compressed format.

See http://stackoverflow.com/questions/6511255/why-cant-hadoop-split-up-a-large-text-file-and-then-compress-the-splits-using-g and http://stackoverflow.com/questions/14820450/best-splittable-compression-for-hadoop-input-bz2.

Details

	Subject	Repo	Branch	Lines +/-
	Publish bzip2 compressed Wikidata json dumps	operations/puppet	production	+11 -5

Customize query in gerrit

Related Objects
Search...

		Status	Subtype	Assigned	Task
		Resolved		ArielGlenn	T70793 Wikidata JSON dump: better compression than gzip
		Resolved		hoo	T115222 [Story] Compress JSON data dumps in Bzip2

Event Timeline

Halfak created this task.Oct 11 2015, 6:51 PM

Halfak raised the priority of this task from to Needs Triage.

Halfak updated the task description. (Show Details)

Halfak added a project: Wikidata.

Halfak moved this task to incoming on the Wikidata board.

Halfak subscribed.

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptOct 11 2015, 6:51 PM

Halfak set Security to None.Oct 11 2015, 6:51 PM

Halfak added a subscriber: NealMcB.

hoo subscribed.Oct 11 2015, 6:53 PM

We discussed this before (and I actually tested a few compression algorithms) and we came to the conclusion that this was not yet needed. It might be time to revisit that decision (although xz performed way better than bzip2 AFAIR, so that might also be something to consider).

hoo added a project: Datasets-General-or-Unknown.Oct 11 2015, 7:01 PM

hoo added a subscriber: daniel.

xz does not have the nice built in support in distributed processing frameworks that bz2 has.

It may be worth re-iterating that I am not concerned about compression ratio. The purpose of this task is to make wikidata JSON dumps easy to process in Hadoop/Spark.

A quick reading suggests that Hadoop/Spark has no native support at all for xz.

Change 245850 had a related patch set uploaded (by Hoo man):
Publish bzip2 compressed Wikidata json dumps

https://gerrit.wikimedia.org/r/245850

gerritbot added a project: Patch-For-Review.Oct 13 2015, 10:04 AM

Lydia_Pintscher assigned this task to hoo.Oct 13 2015, 10:09 AM

Lydia_Pintscher added a project: Wikidata-Sprint-2015-09-29.

Lydia_Pintscher moved this task from Proposed to Review on the Wikidata-Sprint-2015-09-29 board.

Lydia_Pintscher subscribed.

Legoktm subscribed.Oct 13 2015, 10:15 AM

Lydia_Pintscher moved this task from incoming to in progress on the Wikidata board.Oct 13 2015, 11:59 AM

Tobi_WMDE_SW added a project: Wikidata-Sprint-2015-10-13.Oct 13 2015, 12:34 PM

Tobi_WMDE_SW moved this task from Proposed to Review on the Wikidata-Sprint-2015-10-13 board.