Page MenuHomePhabricator

Consider using pigz (Zopfli) for Wikidata JSON dump
Open, Needs TriagePublic

Description

We should consider using pigz for the gzip compressed Wikidata dumps (both json and ttl).

Pigz uses Google's Zopfli code internally, thus leading to approximately 7-8% less output file size than gzip -9 (pigz -11, tested with a small testwikidata JSON dump).

Given we compress the data stream during dump creation, it doesn't matter much that Zopfli is quite a bit slower than plain gzip.

Event Timeline

hoo created this task.Nov 29 2016, 1:55 PM
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptNov 29 2016, 1:55 PM
hoo updated the task description. (Show Details)Nov 29 2016, 1:57 PM

Given the length of time Wikidata weekly dumps take to run, do we still want to do this? What sort of cpu/memory requirements will it have, compared to gzip?

hoo added a comment.Jan 16 2018, 4:00 AM

Given the length of time Wikidata weekly dumps take to run, do we still want to do this? What sort of cpu/memory requirements will it have, compared to gzip?

Memory will probably be the same (gzip uses tiny tiny compression windows… for today's standards at least). CPU will probably be up quite a bit (I would expect this to rise the entire need for the job by at least 10%).

Do we still want this? I'm not sure… but looking into saving space/ bandwith for the gzip dumps sounds sensible to me (but I can't say whether we have the resources for this at hands).

10% is not absurd. I'd say we could go ahead with this now, but what impact will it have on the length of time Wikidata weekly dumps take to run?

hoo added a comment.Jul 20 2018, 8:11 AM

I just tested this with a JSON dump containing the first 2,500 Wikidata entries: The file size was reduced by about 9% (from gzip -9 to pigz --iterations 1 -11 -p1), but pigz --iterations 1 -11 -p1 is about 50 times slower than gzip -9 (0m7.977s user time to 6m41.077s).

Sometime we should actually test piping dumpJson.php output into both and see how much off a difference it makes when the data is "slowly" fed from the dumper.