Page MenuHomePhabricator

Recent wikibase RDF dumps on Airflow have failed
Closed, ResolvedPublic

Description

We have seen recently that some wikibase RDF dumps have failed. The /usr/local/bin/dumpwikibaserdf.sh script exits with a 1 and no output.

www-data@mediawiki-dumps-legacy-toolbox-f58f6fd45-rw86r:~$ /usr/local/bin/dumpwikibaserdf.sh --project wikidata --dump all --continue --format ttl --extra nt
www-data@mediawiki-dumps-legacy-toolbox-f58f6fd45-rw86r:~$ echo $?
1

image.png (774×1 px, 197 KB)

Details

Related Changes in Gerrit:
Related Changes in GitLab:
TitleReferenceAuthorSource BranchDest Branch
Dumps_v1: Disable automatic continuation for the wikibase dumpsrepos/data-engineering/airflow-dags!1580btulliswikibase_disable_continuemain
Customize query in GitLab

Event Timeline

BTullis triaged this task as High priority.Jul 24 2025, 3:45 PM
BTullis added a subscriber: HShaikh.

This behaviour is not seen when the --continue flag is omitted from the command-line, so I believe that it is something to do with this continuation logic.
I will remove it and re-run the jobs.

I have not been able to get these to work by either removing the --continue option, nor increasing the memory.

I'm currently back-filling the 20250723 directory with the following command on dumpsdata1003.

dumpsgen@dumpsdata1003:/data/otherdumps/wikibase/wikidatawiki/20250723$ rsync -av --info=progress2 ./ clouddumps1001.wikimedia.org::data/xmldatadumps/public/other/wikibase/wikidatawiki/20250723/
sending incremental file list
./
wikidata-20250723-lexemes.json.bz2
    382,765,811   0%  332.43MB/s    0:00:01 (xfr#1, to-chk=5/7)
wikidata-20250723-lexemes.json.gz
    901,237,041   0%  361.72MB/s    0:00:02 (xfr#2, to-chk=4/7)
wikidata-20250723-md5sums.txt
    901,237,319   0%  361.57MB/s    0:00:02 (xfr#3, to-chk=3/7)
wikidata-20250723-sha1sums.txt
    901,237,629   0%  361.57MB/s    0:00:02 (xfr#4, to-chk=2/7)
wikidata-20250723-truthy-BETA.nt.bz2
  5,496,916,861   4%  718.75MB/s    0:02:23  
...

I have also done the same for clouddumps1002.

I have fixed up the symlinks in https://dumps.wikimedia.org/other/wikibase/wikidatawiki/

lrwxrwxrwx 1 dumpsgen dumpsgen    39 Jul 23 09:37 latest-all.json.bz2 -> 20250721/wikidata-20250721-all.json.bz2
lrwxrwxrwx 1 dumpsgen dumpsgen    38 Jul 23 03:14 latest-all.json.gz -> 20250721/wikidata-20250721-all.json.gz
lrwxrwxrwx 1 dumpsgen dumpsgen    42 Jul 17 00:55 latest-all.nt.bz2 -> 20250714/wikidata-20250714-all-BETA.nt.bz2
lrwxrwxrwx 1 dumpsgen dumpsgen    41 Jul 17 00:55 latest-all.nt.gz -> 20250714/wikidata-20250714-all-BETA.nt.gz
lrwxrwxrwx 1 dumpsgen dumpsgen    43 Jul 17 00:45 latest-all.ttl.bz2 -> 20250714/wikidata-20250714-all-BETA.ttl.bz2
lrwxrwxrwx 1 dumpsgen dumpsgen    42 Jul 16 21:10 latest-all.ttl.gz -> 20250714/wikidata-20250714-all-BETA.ttl.gz
lrwxrwxrwx 1 dumpsgen dumpsgen    43 Jul 16 03:50 latest-lexemes.json.bz2 -> 20250716/wikidata-20250716-lexemes.json.bz2
lrwxrwxrwx 1 dumpsgen dumpsgen    42 Jul 16 03:49 latest-lexemes.json.gz -> 20250716/wikidata-20250716-lexemes.json.gz
lrwxrwxrwx 1 dumpsgen dumpsgen    46 Jul 18 23:51 latest-lexemes.nt.bz2 -> 20250718/wikidata-20250718-lexemes-BETA.nt.bz2
lrwxrwxrwx 1 dumpsgen dumpsgen    45 Jul 18 23:51 latest-lexemes.nt.gz -> 20250718/wikidata-20250718-lexemes-BETA.nt.gz
lrwxrwxrwx 1 dumpsgen dumpsgen    47 Jul 18 23:51 latest-lexemes.ttl.bz2 -> 20250718/wikidata-20250718-lexemes-BETA.ttl.bz2
lrwxrwxrwx 1 dumpsgen dumpsgen    46 Jul 18 23:49 latest-lexemes.ttl.gz -> 20250718/wikidata-20250718-lexemes-BETA.ttl.gz
lrwxrwxrwx 1 dumpsgen dumpsgen    45 Jun 27 21:07 latest-truthy.nt.bz2 -> 20250625/wikidata-20250625-truthy-BETA.nt.bz2
lrwxrwxrwx 1 dumpsgen dumpsgen    44 Jun 27 18:09 latest-truthy.nt.gz -> 20250625/wikidata-20250625-truthy-BETA.nt.gz
dumpsgen@clouddumps1001:/srv/dumps/xmldatadumps/public/other/wikibase/wikidatawiki$ rm latest-lexemes.json.bz2 latest-lexemes.json.gz latest-truthy.nt.bz2 latest-truthy.nt.gz
dumpsgen@clouddumps1001:/srv/dumps/xmldatadumps/public/other/wikibase/wikidatawiki$ ln -s 20250723/wikidata-20250723-lexemes.json.bz2 latest-lexemes.json.bz2 ; ln -s 20250723/wikidata-20250723-lexemes.json.gz latest-lexemes.json.gz ; ln -s 20250723/wikidata-20250723-truthy-BETA.nt.bz2 latest-truthy.nt.bz2 ; ln -s 20250723/wikidata-20250723-truthy-BETA.nt.gz latest-truthy.nt.gz
dumpsgen@clouddumps1001:/srv/dumps/xmldatadumps/public/other/wikibase/wikidatawiki$ ls -l latest*
lrwxrwxrwx 1 dumpsgen dumpsgen 39 Jul 23 09:37 latest-all.json.bz2 -> 20250721/wikidata-20250721-all.json.bz2
lrwxrwxrwx 1 dumpsgen dumpsgen 38 Jul 23 03:14 latest-all.json.gz -> 20250721/wikidata-20250721-all.json.gz
lrwxrwxrwx 1 dumpsgen dumpsgen 42 Jul 17 00:55 latest-all.nt.bz2 -> 20250714/wikidata-20250714-all-BETA.nt.bz2
lrwxrwxrwx 1 dumpsgen dumpsgen 41 Jul 17 00:55 latest-all.nt.gz -> 20250714/wikidata-20250714-all-BETA.nt.gz
lrwxrwxrwx 1 dumpsgen dumpsgen 43 Jul 17 00:45 latest-all.ttl.bz2 -> 20250714/wikidata-20250714-all-BETA.ttl.bz2
lrwxrwxrwx 1 dumpsgen dumpsgen 42 Jul 16 21:10 latest-all.ttl.gz -> 20250714/wikidata-20250714-all-BETA.ttl.gz
lrwxrwxrwx 1 dumpsgen dumpsgen 43 Jul 25 16:39 latest-lexemes.json.bz2 -> 20250723/wikidata-20250723-lexemes.json.bz2
lrwxrwxrwx 1 dumpsgen dumpsgen 42 Jul 25 16:39 latest-lexemes.json.gz -> 20250723/wikidata-20250723-lexemes.json.gz
lrwxrwxrwx 1 dumpsgen dumpsgen 46 Jul 18 23:51 latest-lexemes.nt.bz2 -> 20250718/wikidata-20250718-lexemes-BETA.nt.bz2
lrwxrwxrwx 1 dumpsgen dumpsgen 45 Jul 18 23:51 latest-lexemes.nt.gz -> 20250718/wikidata-20250718-lexemes-BETA.nt.gz
lrwxrwxrwx 1 dumpsgen dumpsgen 47 Jul 18 23:51 latest-lexemes.ttl.bz2 -> 20250718/wikidata-20250718-lexemes-BETA.ttl.bz2
lrwxrwxrwx 1 dumpsgen dumpsgen 46 Jul 18 23:49 latest-lexemes.ttl.gz -> 20250718/wikidata-20250718-lexemes-BETA.ttl.gz
lrwxrwxrwx 1 dumpsgen dumpsgen 45 Jul 25 16:39 latest-truthy.nt.bz2 -> 20250723/wikidata-20250723-truthy-BETA.nt.bz2
lrwxrwxrwx 1 dumpsgen dumpsgen 44 Jul 25 16:39 latest-truthy.nt.gz -> 20250723/wikidata-20250723-truthy-BETA.nt.gz
dumpsgen@clouddumps1001:/srv/dumps/xmldatadumps/public/other/wikibase/wikidatawiki$

And the same on the active web host.

dumpsgen@clouddumps1002:/srv/dumps/xmldatadumps/public/other/wikibase/wikidatawiki$ ls -l latest*
lrwxrwxrwx 1 dumpsgen dumpsgen 39 Jul 23 09:37 latest-all.json.bz2 -> 20250721/wikidata-20250721-all.json.bz2
lrwxrwxrwx 1 dumpsgen dumpsgen 38 Jul 23 03:14 latest-all.json.gz -> 20250721/wikidata-20250721-all.json.gz
lrwxrwxrwx 1 dumpsgen dumpsgen 42 Jul 17 00:55 latest-all.nt.bz2 -> 20250714/wikidata-20250714-all-BETA.nt.bz2
lrwxrwxrwx 1 dumpsgen dumpsgen 41 Jul 17 00:55 latest-all.nt.gz -> 20250714/wikidata-20250714-all-BETA.nt.gz
lrwxrwxrwx 1 dumpsgen dumpsgen 43 Jul 17 00:45 latest-all.ttl.bz2 -> 20250714/wikidata-20250714-all-BETA.ttl.bz2
lrwxrwxrwx 1 dumpsgen dumpsgen 42 Jul 16 21:10 latest-all.ttl.gz -> 20250714/wikidata-20250714-all-BETA.ttl.gz
lrwxrwxrwx 1 dumpsgen dumpsgen 43 Jul 16 03:50 latest-lexemes.json.bz2 -> 20250716/wikidata-20250716-lexemes.json.bz2
lrwxrwxrwx 1 dumpsgen dumpsgen 42 Jul 16 03:49 latest-lexemes.json.gz -> 20250716/wikidata-20250716-lexemes.json.gz
lrwxrwxrwx 1 dumpsgen dumpsgen 46 Jul 18 23:51 latest-lexemes.nt.bz2 -> 20250718/wikidata-20250718-lexemes-BETA.nt.bz2
lrwxrwxrwx 1 dumpsgen dumpsgen 45 Jul 18 23:51 latest-lexemes.nt.gz -> 20250718/wikidata-20250718-lexemes-BETA.nt.gz
lrwxrwxrwx 1 dumpsgen dumpsgen 47 Jul 18 23:51 latest-lexemes.ttl.bz2 -> 20250718/wikidata-20250718-lexemes-BETA.ttl.bz2
lrwxrwxrwx 1 dumpsgen dumpsgen 46 Jul 18 23:49 latest-lexemes.ttl.gz -> 20250718/wikidata-20250718-lexemes-BETA.ttl.gz
lrwxrwxrwx 1 dumpsgen dumpsgen 45 Jun 27 21:07 latest-truthy.nt.bz2 -> 20250625/wikidata-20250625-truthy-BETA.nt.bz2
lrwxrwxrwx 1 dumpsgen dumpsgen 44 Jun 27 18:09 latest-truthy.nt.gz -> 20250625/wikidata-20250625-truthy-BETA.nt.gz
dumpsgen@clouddumps1002:/srv/dumps/xmldatadumps/public/other/wikibase/wikidatawiki$ rm latest-lexemes.json.bz2 latest-lexemes.json.gz latest-truthy.nt.bz2 latest-truthy.nt.gz
dumpsgen@clouddumps1002:/srv/dumps/xmldatadumps/public/other/wikibase/wikidatawiki$ ln -s 20250723/wikidata-20250723-lexemes.json.bz2 latest-lexemes.json.bz2 ; ln -s 20250723/wikidata-20250723-lexemes.json.gz latest-lexemes.json.gz ; ln -s 20250723/wikidata-20250723-truthy-BETA.nt.bz2 latest-truthy.nt.bz2 ; ln -s 20250723/wikidata-20250723-truthy-BETA.nt.gz latest-truthy.nt.gz
dumpsgen@clouddumps1002:/srv/dumps/xmldatadumps/public/other/wikibase/wikidatawiki$ ls -l latest*
lrwxrwxrwx 1 dumpsgen dumpsgen 39 Jul 23 09:37 latest-all.json.bz2 -> 20250721/wikidata-20250721-all.json.bz2
lrwxrwxrwx 1 dumpsgen dumpsgen 38 Jul 23 03:14 latest-all.json.gz -> 20250721/wikidata-20250721-all.json.gz
lrwxrwxrwx 1 dumpsgen dumpsgen 42 Jul 17 00:55 latest-all.nt.bz2 -> 20250714/wikidata-20250714-all-BETA.nt.bz2
lrwxrwxrwx 1 dumpsgen dumpsgen 41 Jul 17 00:55 latest-all.nt.gz -> 20250714/wikidata-20250714-all-BETA.nt.gz
lrwxrwxrwx 1 dumpsgen dumpsgen 43 Jul 17 00:45 latest-all.ttl.bz2 -> 20250714/wikidata-20250714-all-BETA.ttl.bz2
lrwxrwxrwx 1 dumpsgen dumpsgen 42 Jul 16 21:10 latest-all.ttl.gz -> 20250714/wikidata-20250714-all-BETA.ttl.gz
lrwxrwxrwx 1 dumpsgen dumpsgen 43 Jul 25 16:42 latest-lexemes.json.bz2 -> 20250723/wikidata-20250723-lexemes.json.bz2
lrwxrwxrwx 1 dumpsgen dumpsgen 42 Jul 25 16:42 latest-lexemes.json.gz -> 20250723/wikidata-20250723-lexemes.json.gz
lrwxrwxrwx 1 dumpsgen dumpsgen 46 Jul 18 23:51 latest-lexemes.nt.bz2 -> 20250718/wikidata-20250718-lexemes-BETA.nt.bz2
lrwxrwxrwx 1 dumpsgen dumpsgen 45 Jul 18 23:51 latest-lexemes.nt.gz -> 20250718/wikidata-20250718-lexemes-BETA.nt.gz
lrwxrwxrwx 1 dumpsgen dumpsgen 47 Jul 18 23:51 latest-lexemes.ttl.bz2 -> 20250718/wikidata-20250718-lexemes-BETA.ttl.bz2
lrwxrwxrwx 1 dumpsgen dumpsgen 46 Jul 18 23:49 latest-lexemes.ttl.gz -> 20250718/wikidata-20250718-lexemes-BETA.ttl.gz
lrwxrwxrwx 1 dumpsgen dumpsgen 45 Jul 25 16:42 latest-truthy.nt.bz2 -> 20250723/wikidata-20250723-truthy-BETA.nt.bz2
lrwxrwxrwx 1 dumpsgen dumpsgen 44 Jul 25 16:42 latest-truthy.nt.gz -> 20250723/wikidata-20250723-truthy-BETA.nt.gz

Change #1172682 had a related patch set uploaded (by Btullis; author: Btullis):

[operations/dumps@master] Use 'set -o pipefail' instead of 'set -e' in wikibase scripts

https://gerrit.wikimedia.org/r/1172682

Change #1172682 merged by Btullis:

[operations/dumps@master] Use 'set -o pipefail' instead of 'set -e' in wikibase scripts

https://gerrit.wikimedia.org/r/1172682

Change #1172687 had a related patch set uploaded (by Btullis; author: Btullis):

[operations/dumps@master] Restore 'set -o pipefail' behaviour for sub-shells

https://gerrit.wikimedia.org/r/1172687

Change #1172687 merged by Btullis:

[operations/dumps@master] Restore 'set -o pipefail' behaviour for sub-shells

https://gerrit.wikimedia.org/r/1172687

Mentioned in SAL (#wikimedia-operations) [2025-07-28T10:01:30Z] <btullis@deploy1003> Started scap build-images: Updating mediawiki-cli image for T400383

Mentioned in SAL (#wikimedia-operations) [2025-07-28T10:17:30Z] <btullis@deploy1003> Finished scap build-images: Updating mediawiki-cli image for T400383 (duration: 16m 00s)

All of these wikibase dumps are now running again.

image.png (765×1 px, 212 KB)

I have removed the set -e from the top of the scripts, for now, as they are is not yet robust enough.
But I have added the set -o pipefail to the top of the script, which should help to prevent the generation of corrupted files that we saw in T399077 and T399119

BTullis removed a project: Patch-For-Review.

I'll tentatively resolve this issue, as they seem to be working correctly, now.
If we have further errors or corrupted files, then we can reopent.

We may also want to consider reverting this, depending on the level of reliability we see over the next few weeks.