Page MenuHomePhabricator

Invalid description of Wikimedia Enterprise HTML Dumps
Open, Needs TriagePublic1 Estimated Story Points

Description

Reading here: https://dumps.wikimedia.org/other/enterprise_html/

It says:

Each dump output file consists of a tar.gz archive which, when uncompressed and untarred, contains one file

Emphasis mine. That is not true. Tar contains multiple files. E.g.:

enwiki_0.ndjson
enwiki_10.ndjson
enwiki_11.ndjson
enwiki_12.ndjson
enwiki_13.ndjson
enwiki_14.ndjson
enwiki_15.ndjson
enwiki_16.ndjson
enwiki_17.ndjson
enwiki_18.ndjson
enwiki_19.ndjson
enwiki_1.ndjson
enwiki_20.ndjson
enwiki_21.ndjson
enwiki_22.ndjson
enwiki_23.ndjson
enwiki_24.ndjson
enwiki_25.ndjson
enwiki_26.ndjson
enwiki_27.ndjson
enwiki_28.ndjson
enwiki_2.ndjson
enwiki_3.ndjson
enwiki_4.ndjson
enwiki_5.ndjson
enwiki_6.ndjson
enwiki_7.ndjson
enwiki_8.ndjson
enwiki_9.ndjson