Page MenuHomePhabricator

No wikidata dumps last week (20250203)
Closed, DuplicatePublic

Description

Hello,
In https://dumps.wikimedia.org/wikidatawiki/entities/ , the last week directory (20250203) is empty and this week Json dumps are already late.
Could you check why so that next week have a chance to be generated.

Thanks

Related Objects

Event Timeline

T384625 should be fixed for a week now (I backported the fix last Monday), and https://dumps.wikimedia.org/wikidatawiki/entities/ has a truthy RDF dump from 25 February, but the full dumps are still missing, both in JSON and in RDF format… I have no idea why :/

20250301 dump in progress and looking good, although a bit delayed.

Sorry for the confusion, this refers to wikidatawiki dumps...

wikidata entity is making progress, however, other dumps that use similar code, like wikidatardf-truthy-dump have now been running for 6 days since we cleaned them up on https://phabricator.wikimedia.org/T386255#10601324

We have little confidence any of these will finish correctly.

Ahoelzl renamed this task from No wikidata dumps last week to No wikidata dumps last week (20250203).Mar 12 2025, 3:58 PM

@Ahoelzl The latest RDF dump for latest-all is from 29.01.2025, so now one-and-a-half months old already, see https://dumps.wikimedia.org/wikidatawiki/entities

Do you have a suggestion for a workaround until this works again? We would very much like to update https://qlever.cs.uni-freiburg.de/wikidata because people are using that and relying on it?

My latest comment here is relevant to this ticket: T386255#10646186

The upshot is that we have identified a performance regression affecting the wikidata entity dumps and have effected a workaround. We expect the latest dump from today to have completed in around 16 hours from now.

We haven't identified the root cause of the performance regression, but that work will be carried out in T389199: Fix a performance regression affecting wikibase dumps when using mediawiki analytics replica of s8 - dbstore1009

It might be possible to merge this in as a duplicate of T386255: wmf.wikidata_item_page_link and wmf.wikidata_entity snapshots stuck at 2025-01-20 but I will leave it open for now.

I see a new latest-all.json.gz (from 20-Mar-2025 04:42).

Does this mean that there will be new latest-all.nt.* and latest-all.ttl.* there soon, too?

(Any of these four is fine if that helps, but probably not)

I see a new latest-all.json.gz (from 20-Mar-2025 04:42).

Does this mean that there will be new latest-all.nt.* and latest-all.ttl.* there soon, too?

(Any of these four is fine if that helps, but probably not)

Yes, I believe so. The .gz files have already been created on the intermediary NFS server (dumpsdata1003)

btullis@snapshot1016:/mnt/dumpsdata/otherdumps/wikibase/wikidatawiki/20250318$ ls -l
total 681874460
-rw-r--r-- 1 dumpsgen dumpsgen 285771515173 Mar 20 04:37 wikidata-20250318-all-BETA.ttl.gz
-rw-r--r-- 1 dumpsgen dumpsgen 144658836759 Mar 20 04:42 wikidata-20250318-all.json.gz
-rw-r--r-- 1 dumpsgen dumpsgen          202 Mar 20 11:38 wikidata-20250318-md5sums.txt
-rw-r--r-- 1 dumpsgen dumpsgen          150 Mar 20 10:21 wikidata-20250318-sha1sums.txt
-rw-r--r-- 1 dumpsgen dumpsgen 267808869705 Mar 20 03:32 wikidata-20250318-truthy-BETA.nt.gz

So they are waiting to be synced to the distribution servers (clouddumps100[1-2]), which should happen reasonably soon.

It looks like the stages that write the md5sums and sha1sums are still working, plus the process that creates the .bz2 files (and then their checksums) are still running.

These will all be symlinked from latest-all.nt.* and latest-all.ttl.* that you mentioned.

image.png (389×1 px, 181 KB)

Does this mean that things like https://humaniki.wmcloud.org/ will be working again?

I have updated https://qlever.cs.uni-freiburg.de/wikidata based on the latest data from https://dumps.wikimedia.org/wikidatawiki/entities and it worked, thanks a lot. However, I noticed the following:

The previous version of latest-all.ttl.bz2, from January 29, contained around 20 billion triples.

The latest version of latest-all.ttl.bz2, from March 20, contains around 40 billion triples.

However, the end result has a similar number of triples, so it seems that the latest version of latest-all.ttl.bz2 contains each triple twice (on average or literally).

Any ideas on how that did happen and can you fix this again for the next dump?

@Hannah_Bast I have filed T389787: The latest wikidata entity dump (latest-all.ttl.bz2) contains each triple twice based on your observations. I am not sure whether we will be able to validate your observations and fix it before the next dump run on 20250401, but we can try.