Page MenuHomePhabricator

Provide wikidata downloads as multiple files to make access more robust and efficient
Open, LowPublic

Description

Currently the only download option for wikidata in json format is a single gzipped file (see e.g. the files under https://dumps.wikimedia.org/wikidatawiki/entities/), which is 5.4 GB, compressed.

This makes it hard to reliably get it all, or to get just a subset, or to obtain in parallel, or to mirror on other infrastructures which are designed to facilitate highly parallel downloads (e.g. clusters). In addition, 5.4 GB is too large to easily get into Amazon s3, which has a 5 GB limit for many of the most convenient forms of upload.

Note that e.g. the enwiki downloads are split into up to 182 pieces, which makes it much easier to process.

Event Timeline

NealMcB raised the priority of this task from to Needs Triage.
NealMcB updated the task description. (Show Details)
NealMcB added a project: Wikidata.
NealMcB moved this task to incoming on the Wikidata board.
NealMcB added subscribers: NealMcB, Halfak.
Halfak set Security to None.
hoo triaged this task as Low priority.Oct 11 2015, 6:57 PM
hoo added a subscriber: hoo.

Note that the enwiki dump has split parts that are e.g. 7.5 GB in size. HTTP copes fine with only downloading a part from the full size and starting at an offset.

The recommended download format continues to be JSON as discussed at
https://www.wikidata.org/wiki/Wikidata:Database_download

Since this was reported in 2015, the smallest version of the "latest-all" database has grown more than tenfold from 5.4 GB to 64 GB in size, making the usage challenges far greater. From https://dumps.wikimedia.org/wikidatawiki/entities/:

latest-all.json.bz2 31-Mar-2021 17:03 64697800080

Others are running across the issues, motivating the duplicate issue T278204 which was recently merged. They note that

dumps are currently in fact already produced by multiple shards and then combined into one file

and

There are already no guarantees on the order of documents in dumps

making it seem yet more reasonable to provide them as multiple files not a single file.

What would it take to resolve this issue? How can we help?

Thank you for redirecting me to this issue. As I mentioned in T278204 my main motivation is in fact not downloading in parallel, but processing in parallel. Just decompressing that large file takes half a day on my machine. If I can instead use 12 machines on 12 splits, for example, I can do that decompression (or some other processing) in one hour instead.

I am realizing that maybe the problem is just that bzip2 compression is not multistream but singlestream. Moreover, using newer compression algorithms like zstd might decrease decompression speed even further, removing the need for multiple files altogether. See https://phabricator.wikimedia.org/T222985#7163885

In fact, this is not a problem, see https://phabricator.wikimedia.org/T222985#7164507

pbzip2 is problematic and cannot decompress in parallel files not compressed with pbzip2. But lbzip2 can. So using lbzip2 makes decompression of single file dumps fast. So not sure if it would be faster to have multiple files.

I learned today that Wikipedia has a nice approach with a multistream bz2 archive and additional file with an index, which tells you an offset into the bz2 archive you have to decompress as a chunk to access particular page. Wikidata could do the same, just for items and properties. This would allow one to extract only those entities they care about. Mutlistream also enables one to decompress parts of the file in parallel on multiple machines, by distributing offsets between them. Wikipedia also provides the same multistream archive as multiple files so that one can even easier distribute the whole dump over multiple machines. I like that approach.