Some records appear to be missing from the dumps - is it possible the dumps are incomplete, or am I doing something wrong? I'm new, so my apologies if I'm doing something obviously wrong!
Steps to replicate the issue (include links if applicable):
- Download the latest bz2 dump from https://dumps.wikimedia.org/wikidatawiki/entities/latest-all.json.bz2. At the time of this writing, that was the file produced on 25-Oct-2023 19:30, 85,161,509,921 bytes in size.
- Search it for Q5085764: lbzcat latest-all.json.bz2 | grep Q5085764
What happens?:
There are no records found.
What should have happened instead?:
A record corresponding to https://www.wikidata.org/wiki/Q5085764 should have been emitted.
Software version (skip for WMF-hosted wikis like Wikipedia):
I'm using lbzip2 on Ubuntu 22.04. I also tried stock bzip2.
(Aside: https://www.wikidata.org/wiki/Wikidata:Database_download says: "Note that the files are using parallel compression, which means that some decompressors cannot reliably unpack the files." It was not clear to me what this meant -- that some decompressors will decompress it correctly, but serially; or that some will corrupt the output. In any event, lbzip2 was one of the blessed packages for Unix, I think.)
Thanks for any pointers you can offer!