Consider using pigz (Zopfli) for Wikidata JSON dump
Open, Needs TriagePublic
Actions

Assigned To

None

Authored By

	hoo
	Nov 29 2016, 1:55 PM

Description

We should consider using pigz for the gzip compressed Wikidata dumps (both json and ttl).

Pigz uses Google's Zopfli code internally, thus leading to approximately 7-8% less output file size than gzip -9 (pigz -11, tested with a small testwikidata JSON dump).

Given we compress the data stream during dump creation, it doesn't matter much that Zopfli is quite a bit slower than plain gzip.

Related Objects
Search...

Status	Assigned	Task
Open	None	T88728 Improve Wikimedia dumping infrastructure
Open	None	T88991 improve Wikidata dumps [tracking]
Open	None	T151876 Consider using pigz (Zopfli) for Wikidata JSON dump

Event Timeline

hoo created this task.Nov 29 2016, 1:55 PM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptNov 29 2016, 1:55 PM

hoo updated the task description. (Show Details)Nov 29 2016, 1:57 PM

Given the length of time Wikidata weekly dumps take to run, do we still want to do this? What sort of cpu/memory requirements will it have, compared to gzip?

In T151876#3901029, @ArielGlenn wrote:

Given the length of time Wikidata weekly dumps take to run, do we still want to do this? What sort of cpu/memory requirements will it have, compared to gzip?

Memory will probably be the same (gzip uses tiny tiny compression windows… for today's standards at least). CPU will probably be up quite a bit (I would expect this to rise the entire need for the job by at least 10%).

Do we still want this? I'm not sure… but looking into saving space/ bandwith for the gzip dumps sounds sensible to me (but I can't say whether we have the resources for this at hands).

10% is not absurd. I'd say we could go ahead with this now, but what impact will it have on the length of time Wikidata weekly dumps take to run?

hoo added a parent task: T88991: improve Wikidata dumps [tracking].Apr 10 2018, 2:14 PM

I just tested this with a JSON dump containing the first 2,500 Wikidata entries: The file size was reduced by about 9% (from gzip -9 to pigz --iterations 1 -11 -p1), but pigz --iterations 1 -11 -p1 is about 50 times slower than gzip -9 (0m7.977s user time to 6m41.077s).

Sometime we should actually test piping dumpJson.php output into both and see how much off a difference it makes when the data is "slowly" fed from the dumper.

Consider using pigz (Zopfli) for Wikidata JSON dumpOpen, Needs TriagePublicActions

Description

Related ObjectsSearch...

Event Timeline

Consider using pigz (Zopfli) for Wikidata JSON dump
Open, Needs TriagePublic
Actions

Related Objects
Search...