Page MenuHomePhabricator

wikidata-20250707-all.json.gz is corrupted
Closed, ResolvedPublicBUG REPORT

Description

The latest wikidata entities dump (at least the concrete file mounted on the toolforge hosts; I have not tried this with a downloaded version, though the SHA1 hash of the affected file matches the published sum) does not decompress cleanly:

gzip: /public/dumps/public/wikidatawiki/entities/20250707/wikidata-20250707-all.json.gz: invalid compressed data--format violated

Reading with the rust flate2 crate reports a “corrupt deflate stream” error.

Event Timeline

Thanks for the report.

The internal file doesn't seem to match with the public one.

Internally we have:

$ hostname -f
dumpsdata1003.eqiad.wmnet
$ pwd
/data/otherdumps/wikidata
$ md5sum 20250707.json.gz
07660efc917c754cf00429fda3c1ca0e  20250707.json.gz

while our publicly advertised md5 reads:

da634f14ea8194efcb90454e914ef018  wikidata-20250707-all.json.gz
2e2ea8541eb32a7f15365e2de14e59e2  wikidata-20250707-all.json.bz2

But the internal md5 summary does match with the internal file:

xcollazo@dumpsdata1003:/data/otherdumps/wikibase/wikidatawiki/20250707$ cat wikidata-20250707-md5sums.txt 
07660efc917c754cf00429fda3c1ca0e  wikidata-20250707-all.json.gz
a90a4cd1f751c500c5025ae05b31da88  wikidata-20250707-all.json.bz2

I know there were recent issues because we internally switched the way we produce these dumps. @BTullis could this be an rsync issue?

Thanks for the report! We have identified an issue with the job generating the wikibase dumps, that could yield to this file corruption when running out of memory. I am going to remove the corrupted file, backfill it from another location, as well as fix the root cause.

I have backfilled the non corrupted dumps to https://dumps.wikimedia.org/other/wikibase/wikidatawiki/20250707/

brouberol@dumpsdata1003:~$ rsync -a /data/otherdumps/wikibase/wikidatawiki/20250707/wikidata-20250707-all.json.gz clouddumps1002.wikimedia.org::data/xmldatadumps/public/other/wikibase/wikidatawiki/20250707/wikidata-20250707-all.json.gz

I will now backfill them on clouddumps1001 as well (the standby host for dumps).

The backfill to clouddumps1001 is done as well.

Change #1170459 had a related patch set uploaded (by Brouberol; author: Brouberol):

[operations/dumps@master] dumpwikibasejson: ensure the dump script exists after any error

https://gerrit.wikimedia.org/r/1170459

Change #1170459 merged by Brouberol:

[operations/dumps@master] dumpwikibase: ensure the dump script exists after any error

https://gerrit.wikimedia.org/r/1170459

Not sure if this is related, but the latest Lexemes dump seems is also empty:

https://dumps.wikimedia.org/wikidatawiki/entities/20250723/

Change #1172682 had a related patch set uploaded (by Btullis; author: Btullis):

[operations/dumps@master] Use 'set -o pipefail' instead of 'set -e' in wikibase scripts

https://gerrit.wikimedia.org/r/1172682

Change #1172682 merged by Btullis:

[operations/dumps@master] Use 'set -o pipefail' instead of 'set -e' in wikibase scripts

https://gerrit.wikimedia.org/r/1172682

Not sure if this is related, but the latest Lexemes dump seems is also empty:

https://dumps.wikimedia.org/wikidatawiki/entities/20250723/

Thanks @DVrandecic - I have now back-filled these from the backup copy of the wikibase dumps that are still running on snapshot1016. It's detailed here: T400383: Recent wikibase RDF dumps on Airflow have failed

I think that https://gerrit.wikimedia.org/r/1172682 should fix the corrupted files issue, although, unfortunately I have had to disable the set -e that was added last week.
Although it's ideal to have that set, I think that the scripts to carry out the wikibase backups aren't quite robust enough to use it yet.

At least setting set -o pipefile at the top-level of the script should cause the compression part of the script...

gzip -dc "$targetDir/$filename.$dumpFormat.gz" | "$lbzip2" -n $nthreads -c > $tempDir/$projectName$dumpFormat-$dumpName.bz2

...to be included in the scope of that option. Previously it had only been set in the sub-shells.