Page MenuHomePhabricator

bzip2 wikidata dump incomplete
Closed, ResolvedPublicBUG REPORT

Description

Hello,

The bzip2 wikidata dump for this week is here : https://dumps.wikimedia.org/wikidatawiki/entities/20250707/
Comparing its size with previous week bzip2 dump from here : https://dumps.wikimedia.org/wikidatawiki/entities/20250630/
We can see that the size is way too short :

wikidata-20250707-all.json.bz2 09-Jul-2025 04:26 4913992760
wikidata-20250630-all.json.bz2 02-Jul-2025 08:28 96958600959

Whereas the gzip dumps are about the same size between both weeks.
The only conclusion is that this week bzip2 is incomplete.

Details

Related Changes in Gerrit:
Related Changes in GitLab:
TitleReferenceAuthorSource BranchDest Branch
Increase the memory for wikibase->wikidata dumpsrepos/data-engineering/airflow-dags!1582btullisbump_wikibase_dump_resourcesmain
Customize query in GitLab

Event Timeline

As @xcollazo mentioned, these issues are indeed the same. I'll post progress reports on T399077. Note that I have backfilled the data to https://dumps.wikimedia.org/other/wikibase/wikidatawiki/20250707/ so you should be able to use these files while we address the root cause.

Change #1172682 had a related patch set uploaded (by Btullis; author: Btullis):

[operations/dumps@master] Use 'set -o pipefail' instead of 'set -e' in wikibase scripts

https://gerrit.wikimedia.org/r/1172682

Change #1172682 merged by Btullis:

[operations/dumps@master] Use 'set -o pipefail' instead of 'set -e' in wikibase scripts

https://gerrit.wikimedia.org/r/1172682

BTullis subscribed.

I think that this should now be fixed. I'll move this to waiting, while we monitor the latest dumps.