Page MenuHomePhabricator

No Wikidata dumps for Week 40 of 2025 (recurring issue)
Closed, ResolvedPublicBUG REPORT

Assigned To
Authored By
Hannah_Bast
Oct 5 2025, 3:41 PM
Referenced Files
F66766452: image.png
Mon, Oct 20, 11:02 AM
F66766213: image.png
Mon, Oct 20, 10:00 AM
F66766199: image.png
Mon, Oct 20, 10:00 AM
F66766187: image.png
Mon, Oct 20, 10:00 AM
F66766180: image.png
Mon, Oct 20, 10:00 AM
F66747272: image.png
Oct 13 2025, 9:51 AM
F66747268: image.png
Oct 13 2025, 9:51 AM
F66747261: image.png
Oct 13 2025, 9:51 AM

Description

The last dumps of latest-all.* on https://dumps.wikimedia.org/wikidatawiki/entities/ are from two weeks ago (Week 39 of 2025). There are no new dumps from last week (Week 40 of 2025).

This has occurred several times already over the last months, see https://phabricator.wikimedia.org/T398756

Details

Related Changes in GitLab:
TitleReferenceAuthorSource BranchDest Branch
Switch the wikibase dumps to use the core database serversrepos/data-engineering/airflow-dags!1747btullisupdate_wikibase_servergroupmain
Customize query in GitLab

Event Timeline

Hi @Hannah_Bast! We've checked and are on it to figure out what's happening together with the WMF

Hi @Hannah_Bast - I'm sorry that you're affected by this issue. I can confirm that we're also seeing a significant slowdown of these dumps in the last couple of weeks, as well as some errors.

Let's look at the behaviour of the RDF dumps, first.
Here we can see the runtime of the mediawiki_wikidata_all_rdf_dump DAG. - You can see that the most recent run is approaching 7 days' duration, which means that it may time out. However last week's dump claims to have finished successfully.

image.png (767×1 px, 164 KB)

Similarly, here is the mediawiki_wikidata_truthy_rdf_dump - It is showing similar behaviour, with last week's dumps having timed out and this weeks' having taken over 4 days, so far.

image.png (764×1 px, 136 KB)

The next is: mediawiki_commons_mediainfo_rdf_dump - This DAG shows similar behaviour for last week, although this week's dump has not been running for long enough to suggest whether this week's will exhibit the same behaviour.

image.png (620×1 px, 94 KB)

The mediawiki_wikidata_lexemes_rdf_dump is not the same, but it does have slightly longer runtime for the last two weeks.

image.png (799×1 px, 145 KB)

Those are all RDF dumps, but the JSON dumps do show similar behaviour, too.

This one is mediawiki_wikidata_all_json_dump

image.png (659×1 px, 122 KB)

This one is: mediawiki_commons_mediainfo_json_dump

image.png (701×1 px, 131 KB)

This one is: mediawiki_wikidata_lexemes_json_dump - This shows extended runtime for last week, but no errors.

image.png (669×1 px, 115 KB)

There is one change that has potentially important consequences, but I don't believe that it affects all of these dumps, so I am not sure that it can account for all of the slow behaviour.
This is that we recently re-added the serdi binary to the mediawiki-cli image that is used to run the dumps.

serdi is used to generate the extra format dumps, and it is called in the wikibase dump scripts here: https://gerrit.wikimedia.org/g/operations/dumps/+/master/scripts/wikibase/dumpwikibaserdf.sh#232

I believe that this was merged on or around September 16th.

Oh, I have discovered some error messages here:

image.png (940×1 px, 374 KB)

The first set of errors look as if there is an error in moving the generated files to the right location.

The second looks as if it is a problem with the DCAT functionality, as it cannot call the mediawiki/wikidata API to download a configuration for this.
I'll continue to look into whether either of these sets of errors could explain the delays.

The mediawiki_wikidata_lexemes_json_dump has completed successfully this morning.
Slightly shorter runtime than last week, but still above the baseline that had been established over the preceeding weeks.

image.png (605×1 px, 107 KB)

It's not looking good for the mediawiki_wikidata_truthy_rdf_dump - This is approaching 6.5 days of runtime and will likely time out at 7 days.

image.png (612×839 px, 75 KB)

We can see that this dump has had two attempts, but the second attempt has been running for 1 day and 20 hours.
image.png (899×1 px, 304 KB)

This first attempt was likely interrupted by the rolling restart of the dse-k8s-eqiad kubernetes cluster that occurred on Monday this week (T405361#11245238)
Unfortunately, the wikibase dumps do not automatically continue where they left off when restarting a dump. They start again from the beginning.

There is an option for doing do, by passing a --continue option to the scripts, but although we tried to use this by default when launching the dumps, we had to switch it off:
See: https://gitlab.wikimedia.org/repos/data-engineering/airflow-dags/-/commit/fe0369f978930ae68055a012b8cfba3ab619c6bd and T400383: Recent wikibase RDF dumps on Airflow have failed

If it were possible to fix the scripts so that we could use the --continue flag by default, that would likely help with the availability of these dumps.

There is also the matter of enabling more robust error handling in the scripts, in general: T406044: Re-enable `set -e` in Wikibase entity dump scripts

One problem, from my perspective, is that I have very little idea of the overall progress of these dumps, from the log files. We have messages scrolling past in the logs like this:

[2025-10-08, 10:58:01 UTC] {pod_manager.py:536} INFO - [base] (2025-10-08T10:58+00:00) Starting batch 44
[2025-10-08, 10:58:01 UTC] {pod_manager.py:536} INFO - [base] Dumping entities of type item, property
[2025-10-08, 10:58:02 UTC] {pod_manager.py:536} INFO - [base] Dumping shard 6/8

But this doesn't mean much to me if I am trying to gauge how complete this dump process is.

On a side note, should these wikibase dumps be running under the airflow-wmde or airflow-wikidata instances, where you might be able to interact with them more fully?
I'm not trying to shirk responsibility with this question, but more to facilitate better self-service for consumers of these dumps and ascertain what the correct boundaries of responsibilities should be.

On a side note, should these wikibase dumps be running under the airflow-wmde or airflow-wikidata instances, where you might be able to interact with them more fully?

The Wikidata Platform, Data Platform SRE, and WMDE teams met to discuss this question and aligned that there should not be any change to the current data dump ownership. WMDE, as owners of Wikibase, will continue to own and escalate issues to Data Platform SREs where needed.

Dear all, any update on this? In particular, is there a chance that there will be a new dump this week?

PS: We have been creating our own dumps from https://qlever.dev/wikidata (which is synced via the Wikidata Public Update Stream) in the meantime. One such dump takes a few hours when using a simple CONSTRUCT WHERE { ?s ?p ?o } query, and just one hour when using QLever's own internal format. However, there will be small deviations because of imperfections in the update process, so starting from a fresh dump once per week would be good. Also, we currently do not yet "munge" the dump like WDQS does, see https://phabricator.wikimedia.org/T406436

I'm sorry to say that I have very little positive news to report on this.

We can see that the behaviour trend from the last two weeks is continuing.
For example, here is the current mediawiki_wikidata_all_rdf_dump dump - It is approaching 7 days' runtime, at which point it will fail and start again.

image.png (458×1 px, 81 KB)

As another example, mediawiki_commons_mediainfo_json_dump has failed in this manner for the last two weeks.

image.png (545×1 px, 72 KB)

We can also see that it's not a matter of a shortage in CPU or RAM, in the containers.
This dashboard shows all currently running wikibase dump containers.

They all show the same behaviour, which is that neither CPU nor RAM is anywhere near to being fully utilised nor exhausted.

image.png (2×1 px, 342 KB)

Yet the logs for the jobs show no errors. They're just not fast enough.

image.png (707×913 px, 234 KB)

I have escalated the issue to the Data-Engineering team, but at this point I am short of ideas for how to troubleshoot further.
I would be looking to try to find regressions that could have occurred due to mediawiki deployments or changes to the wikibase extension.

I haven't ruled out MariaDB , but it seems a little unlikely given that both commonswiki and wikidatawiki are affected. These are on the s4 and s8 sections respectively, so something would have to be affecting both services to have this effect.

It is very good to hear that you are making such good progress with QLever. I'll try to keep updated regarding changes here, but for now I am going to have to seek additional input to address this performance issue with wikibase dumps.

It's possible that the performance regression is similar to that observed here:
T389199: Fix a performance regression affecting wikibase dumps when using mediawiki analytics replica of s8 - dbstore1009

Just noting it for now to establish the link between the two tickets.

@BTullis how to you think Data-Engineering should best help here? Who knows the most and might be able to help? @xcollazo ? @pfischer ?

@BTullis @dcausse In the meantime, would it be an option to just ask Blazegraph the query CONSTRUCT WHERE { ?s ?p ?o } without timeout? I understand that this would probably take a long time in Blazegraph, but (a) it would finish eventually, and (b) assuming that Blazegraph implements the SPARQL standard correctly, it would actually provide a snapshot of the data, unlike the weekly dumps.

Good news! The patch to switch database servers seems to have worked.
The latest runs of mediawiki_commons_mediainfo_json_dump and mediawiki_wikidata_all_rdf_dump and mediawiki_wikidata_truthy_rdf_dump are all back to their normal duration.

image.png (669×1 px, 95 KB)

image.png (734×1 px, 105 KB)

image.png (806×1 px, 124 KB)

I would say that we have now identified this as a problematic code path. It's very unlikely to be related to the database servers themselves.

There is still an issue around the publishing of these dumps, because the job that sync them from the intermediate data store (cephfs) to the servers behind https://dumps.mediawiki.org (clouddumps100[1-2]) is timing out at 24 hours.

image.png (583×830 px, 72 KB)

However, this is a different issue and one that is well within our control. It's related to a lack of housekeeping done by the wikibase dumps.
I will handle this housekeeping manually for now, then sync the latest dumps and create a ticket to follow up on this.

I have manually removed 5.1 TB of old dumps from the cephfs volume in T407735#11289038 and I have manually triggered a new run of the sync_wikibase_wikidatawiki_dumps DAG.

image.png (636×1 px, 136 KB)

This reminds me of T389199 (which ended up not being reproducible, just noting for reference).

This reminds me of T389199 (which ended up not being reproducible, just noting for reference).

Agreed. Thanks, @hoo. Indeed the workaround that we have in place is the same, although triggered slightly differently by a SERVERGROUP environment variable rather than the presence of a /etc/wikimedia-servergroup file.

The strange thing is that the wikibase fumps have been working at (seemingly) full speed from March until September. They were migrated into Kubernetes in July and the speed was unaffected.
Then towards the end of September, the speed dropped again suddenly, just like in T389199.

I can't find any related change that might have caused it.

I'm going to resolve this now, as the wikibase dumps are back up to speed with the workaround in place.
Further investigation can continue under T408090 and Data-Platform-SRE and Data-Engineering will help out, as needed.

I have a few follow-up tickets to create, based on observations that we have made during the investigation, but aren't directly related to the availability of the dumps.
We already know that there is some missing housekeeping of the intermediate storage location, as described in: T407735

Other issues that I will raide in new tickets include:

Finally, we also know that we want to do this: T406044: Re-enable `set -e` in Wikibase entity dump scripts but we need to be reasonbly confident that it won't break things, before we do.