Page MenuHomePhabricator

latest-all.json.bz2 does not contain a record for Charlies Bunion (Q5085764)
Open, Needs TriagePublicBUG REPORT

Description

Some records appear to be missing from the dumps - is it possible the dumps are incomplete, or am I doing something wrong? I'm new, so my apologies if I'm doing something obviously wrong!

Steps to replicate the issue (include links if applicable):

  1. Download the latest bz2 dump from https://dumps.wikimedia.org/wikidatawiki/entities/latest-all.json.bz2. At the time of this writing, that was the file produced on 25-Oct-2023 19:30, 85,161,509,921 bytes in size.
  2. Search it for Q5085764: lbzcat latest-all.json.bz2 | grep Q5085764

What happens?:

There are no records found.

What should have happened instead?:

A record corresponding to https://www.wikidata.org/wiki/Q5085764 should have been emitted.

Software version (skip for WMF-hosted wikis like Wikipedia):

I'm using lbzip2 on Ubuntu 22.04. I also tried stock bzip2.

(Aside: https://www.wikidata.org/wiki/Wikidata:Database_download says: "Note that the files are using parallel compression, which means that some decompressors cannot reliably unpack the files." It was not clear to me what this meant -- that some decompressors will decompress it correctly, but serially; or that some will corrupt the output. In any event, lbzip2 was one of the blessed packages for Unix, I think.)

Thanks for any pointers you can offer!

Event Timeline

Alternatively, I tried the XML dumps. It seems like the record was present there, if that's useful information:

$ wget https://dumps.wikimedia.org/wikidatawiki/latest/wikidatawiki-latest-pages-articles-multistream6.xml-p4469005p5969004.bz2
$ bzcat wikidatawiki-latest-pages-articles-multistream6.xml-p4469005p5969004.bz2 | grep Q5085764
[ ... spew of XML, ultimately containing this JSON for that record: https://gist.github.com/cldellow/eed20dc70bd13f24b32c1d8d4728a0f7 ]

This will allow me to workaround the issue in the larger dumps.

The only wrinkle is that the format is not quite as well normalized. As https://www.wikidata.org/wiki/Wikidata:Database_download says:

The format of the JSON data embedded in the XML dumps is subject to change without notice, and may be inconsistent between revisions. It should be treated as opaque binary data. It is strongly recommended to use the JSON or RDF dumps instead, which use canonical representations of the data!

And indeed, Wikipedia links are present in the sitelinks key, not as property 4656.

indeed, there are quite some differences in the different pipelines. When the Wikidata folks look at this, do ping us as we have been working on a new dumps process and migrating other dumps to our Airflow scheduler. cc @VirginiaPoundstone