Page MenuHomePhabricator

Consider compressing uncompressed dump files (abstracts, siteinfo-namespaces)
Closed, ResolvedPublic

Description

The abstracts dump is quite is large for some sites; e.g. enwiki has 5G, which would be only 660M with gzip compression. Similarly, wikidatawiki abstracts are 59GB now, but only 4.1G gzipped.

Compression would also be nice even for small files such as the siteinfo-namespaces dump because we could then easily distinguish between status files (html/json/txt) and dump content (gz/bz2/7z) without a hardcoded list.

Event Timeline

Making a command decision and Just Doing This. Email sent to xmldatadumps-l list. Unless I hear strenuous objections it's going to happen for the next run on the 20th (or the 1st, if we don't quite get done in time for a second run this month, we were delayed a couple days with the move of xml/sql dumps to the new servers).

Change 392455 had a related patch set uploaded (by ArielGlenn; owner: ArielGlenn):
[operations/dumps@master] gzip compress output from api jobs and abstracts dumps

https://gerrit.wikimedia.org/r/392455

Change 392455 merged by ArielGlenn:
[operations/dumps@master] gzip compress output from api jobs and abstracts dumps

https://gerrit.wikimedia.org/r/392455

Done, deployed, ran as expected, closing.