Page MenuHomePhabricator

Provide wikidata downloads as multiple files to make access more robust and efficient
Open, LowPublic

Description

Currently the only download option for wikidata in json format is a single gzipped file (see e.g. the files under https://dumps.wikimedia.org/wikidatawiki/entities/), which is 5.4 GB, compressed.

This makes it hard to reliably get it all, or to get just a subset, or to obtain in parallel, or to mirror on other infrastructures which are designed to facilitate highly parallel downloads (e.g. clusters). In addition, 5.4 GB is too large to easily get into Amazon s3, which has a 5 GB limit for many of the most convenient forms of upload.

Note that e.g. the enwiki downloads are split into up to 182 pieces, which makes it much easier to process.

Event Timeline

NealMcB created this task.Oct 11 2015, 6:56 PM
NealMcB raised the priority of this task from to Needs Triage.
NealMcB updated the task description. (Show Details)
NealMcB added a project: Wikidata.
NealMcB moved this task to incoming on the Wikidata board.
NealMcB added subscribers: NealMcB, Halfak.
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptOct 11 2015, 6:56 PM
Halfak updated the task description. (Show Details)Oct 11 2015, 6:57 PM
Halfak set Security to None.
hoo triaged this task as Low priority.Oct 11 2015, 6:57 PM
hoo added a subscriber: hoo.
Hydriz added a subscriber: Hydriz.Oct 17 2015, 8:08 AM

Note that the enwiki dump has split parts that are e.g. 7.5 GB in size. HTTP copes fine with only downloading a part from the full size and starting at an offset.

Addshore moved this task from incoming to monitoring on the Wikidata board.Dec 4 2015, 1:19 PM
abian added a subscriber: abian.Mar 6 2017, 4:35 PM