Consider compressing uncompressed dump files (abstracts, siteinfo-namespaces)
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	ArielGlenn
	Oct 12 2017, 8:28 AM

Description

The abstracts dump is quite is large for some sites; e.g. enwiki has 5G, which would be only 660M with gzip compression. Similarly, wikidatawiki abstracts are 59GB now, but only 4.1G gzipped.

Compression would also be nice even for small files such as the siteinfo-namespaces dump because we could then easily distinguish between status files (html/json/txt) and dump content (gz/bz2/7z) without a hardcoded list.

Details

	Subject	Repo	Branch	Lines +/-
	gzip compress output from api jobs and abstracts dumps	operations/dumps	master	+5 -5

Customize query in gerrit

Event Timeline

ArielGlenn created this task.Oct 12 2017, 8:28 AM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptOct 12 2017, 8:28 AM

ArielGlenn triaged this task as Medium priority.Oct 12 2017, 8:29 AM

ArielGlenn added projects: Dumps-Generation, User-ArielGlenn.

ArielGlenn moved this task from Backlog to Active on the Dumps-Generation board.Oct 24 2017, 11:47 AM

Making a command decision and Just Doing This. Email sent to xmldatadumps-l list. Unless I hear strenuous objections it's going to happen for the next run on the 20th (or the 1st, if we don't quite get done in time for a second run this month, we were delayed a couple days with the move of xml/sql dumps to the new servers).

Arkanosis subscribed.Nov 6 2017, 1:21 PM

ArielGlenn moved this task from Short-term backlog to This week on the User-ArielGlenn board.Nov 6 2017, 10:29 PM

Change 392455 had a related patch set uploaded (by ArielGlenn; owner: ArielGlenn):
[operations/dumps@master] gzip compress output from api jobs and abstracts dumps

https://gerrit.wikimedia.org/r/392455

gerritbot added a project: Patch-For-Review.Nov 20 2017, 6:25 PM

Change 392455 merged by ArielGlenn:
[operations/dumps@master] gzip compress output from api jobs and abstracts dumps

https://gerrit.wikimedia.org/r/392455

Done, deployed, ran as expected, closing.

ArielGlenn moved this task from Active to Done on the Dumps-Generation board.Dec 3 2017, 4:02 PM

ArielGlenn moved this task from This week to Done on the User-ArielGlenn board.Aug 7 2018, 9:59 AM

Consider compressing uncompressed dump files (abstracts, siteinfo-namespaces)Closed, ResolvedPublicActions

Description

Details

Event Timeline

Consider compressing uncompressed dump files (abstracts, siteinfo-namespaces)
Closed, ResolvedPublic
Actions