Page MenuHomePhabricator

Invalid description of Wikimedia Enterprise HTML Dumps
Closed, ResolvedPublic1 Estimated Story Points

Description

Reading here: https://dumps.wikimedia.org/other/enterprise_html/

It says:

Each dump output file consists of a tar.gz archive which, when uncompressed and untarred, contains one file

Emphasis mine. That is not true. Tar contains multiple files. E.g.:

enwiki_0.ndjson
enwiki_10.ndjson
enwiki_11.ndjson
enwiki_12.ndjson
enwiki_13.ndjson
enwiki_14.ndjson
enwiki_15.ndjson
enwiki_16.ndjson
enwiki_17.ndjson
enwiki_18.ndjson
enwiki_19.ndjson
enwiki_1.ndjson
enwiki_20.ndjson
enwiki_21.ndjson
enwiki_22.ndjson
enwiki_23.ndjson
enwiki_24.ndjson
enwiki_25.ndjson
enwiki_26.ndjson
enwiki_27.ndjson
enwiki_28.ndjson
enwiki_2.ndjson
enwiki_3.ndjson
enwiki_4.ndjson
enwiki_5.ndjson
enwiki_6.ndjson
enwiki_7.ndjson
enwiki_8.ndjson
enwiki_9.ndjson

Event Timeline

Protsack.stephan raised the priority of this task from Low to Needs Triage.Oct 12 2022, 9:42 AM
creynolds changed the task status from Open to In Progress.EditedApr 9 2025, 11:53 PM
creynolds closed this task as Resolved.
creynolds claimed this task.

A lot has happened since this ticket but to update:

With that all said, I'm going to close this ticket as the dumps page will be updated pending Gerrit approval(s) and deploy.

Oh, thanks. I didn't know that they are not made available anymore.

BTW, why has this directory been created, after the cut-off date? I hope this does not mean that old snapshots will slowly get removed by some cron job which is creating these empty directories?

Hey @Mitar those empty dirs shouldn't be created but there's a bug being worked on actively that'll fix it. There's also a talk page update about this too here.

Those empty dirs are still present: https://dumps.wikimedia.org/other/enterprise_html/runs/ Could they be removed please? So that it is easy to find the latest version of those dumps which are available archived?