Page MenuHomePhabricator

wmf.wikidata_item_page_link and wmf.wikidata_entity snapshots stuck at 2025-01-20
Closed, ResolvedPublic

Description

There is an issue with the data in wmf.wikidata_item_page_link and wmf.wikidata_entity snapshots being stuck at 2025-01-20. That is a problem for the services that consume them.

Downstream tracking task: T385865: Resume data pipeline operations

T385865#10538331 yielded zero section topics, which should also entail zero SLIS. The intuition is that mismatched input snapshots seem to disrupt SLIS.
SLIS is still at 2024-12-23, so I reset its Cassandra TTL.
ALIS is at 2025-01-20.

As of now, wmf.wikidata_item_page_link and wmf.wikidata_entity latest available snapshots are still 2025-01-20. I don't think it makes sense to resume normal operations if weekly inputs are missing.
I have paused all pipelines again.

The wmf.wikidata_item_page_link and wmf.wikidata_entity data is used by the Strucutred Data team to generate image suggestions. These image suggestions are used by the Growth Team to recommend edits to Newcomers. This is done to address the decline of active editors in our projects.

Details

Related Objects

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

I did a cursory check on the server that generates the wikidata json and RDF dumps at snapshot1016.eqiad.wmnet. That process is upstream of our Airflow pipelines. All looks nominal over there:

xcollazo@snapshot1016:~$ systemctl --type=service | grep dump
  adds-changes.service                       loaded activating start   start Regular jobs to generate misc dumps
  cirrussearch-dump-s1.service               loaded activating start   start Regular jobs to build snapshot of cirrus search
  cirrussearch-dump-s3.service               loaded activating start   start Regular jobs to build snapshot of cirrus search
  cirrussearch-dump-s4.service               loaded activating start   start Regular jobs to build snapshot of cirrus search
  cirrussearch-dump-s7.service               loaded activating start   start Regular jobs to build snapshot of cirrus search
  cirrussearch-dump-s8.service               loaded activating start   start Regular jobs to build snapshot of cirrus search
  commonsjson-dump.service                   loaded activating start   start Regular jobs to build json snapshot of commons structured data
  commonsrdf-dump.service                    loaded activating start   start Regular jobs to build rdf snapshot of commons structured data
  wikidatajson-dump.service                  loaded activating start   start Regular jobs to build json snapshot of wikidata
  wikidatardf-all-dumps.service              loaded activating start   start Regular jobs to build rdf snapshot of wikidata
  wikidatardf-truthy-dumps.service           loaded activating start   start Regular jobs to build rdf snapshot of wikidata truthy statements

There is, however, a currently open task T386401: No wikidata dumps last week (20250203), which has been acknowledge by the Wikidata folks:

Probably caused by T384625 (see esp. T384625#10544272).

TL;DR: This current issue looks like a mediawiki bug introduced recently. The timing matches. T384625 is currently marked as high priority.

(Not introduced recently, just triggered by recent on-wiki edits. The problematic items have been cleaned up in the meantime, so in theory the dumps should start working again this week even if we didn’t get around to fixing the bug yet.)

(A fix for T384625 has been merged. We will now wait till it is deployed, looks like ETA is 2024-02-25.)

This comment was removed by Ahoelzl.

(A fix for T384625 has been merged. We will now wait till it is deployed, looks like ETA is 2024-02-25.)

T384625 was backported today. This means that the current production version of MW should have the fix.

I have blocked some time tomorrow to take a look at what can we do to expedite an entities dump.

I suspected that, due to the failures from T384625, the job for json entity dumps was stuck since it have been running for 1+ week while it should not take that long:

xcollazo@snapshot1016:~$ hostname -f
snapshot1016.eqiad.wmnet

xcollazo@snapshot1016:~$ systemctl status wikidatajson-dump.service | head -n 10
● wikidatajson-dump.service - Regular jobs to build json snapshot of wikidata
     Loaded: loaded (/lib/systemd/system/wikidatajson-dump.service; static)
     Active: activating (start) since Sun 2025-02-16 01:11:55 UTC; 1 weeks 2 days ago
TriggeredBy: ● wikidatajson-dump.timer
       Docs: https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
   Main PID: 297668 (systemd-timer-m)
      Tasks: 26 (limit: 76560)
     Memory: 3.1G
     CGroup: /system.slice/wikidatajson-dump.service
             ├─297668 /usr/bin/python3 /usr/local/bin/systemd-timer-mail-wrapper --subject wikidatajson-dump --mail-to root@snapshot1016.eqiad.wmnet --only-on-error /usr/local/bin/dumpwikibasejson.sh -p wikidata -d all

So maybe we needed to systemctl restart wikidatajson-dump.service ? Unfortunately I have no privileges to do that.

@RKemper and @bking helped do that. As per Slack, the job is now running again with no failures:

Feb 03 03:15:00 snapshot1016 systemd[1]: Starting Regular jobs to build json snapshot of wikidata...
Feb 16 01:11:55 snapshot1016 dumpwikibasejson.sh[2973260]: mv: cannot move '/mnt/dumpsdata/xmldatadumps/temp/wikidata-all.json.gz' to '/mnt/dumpsdata/otherdumps/wikibase/wikidatawiki/20250203/wikidata-20250203-all.json.gz': No such file or directory
Feb 16 01:11:55 snapshot1016 dumpwikibasejson.sh[2973260]: md5sum: /mnt/dumpsdata/otherdumps/wikibase/wikidatawiki/20250203/wikidata-20250203-all.json.gz: No such file or directory
Feb 16 01:11:55 snapshot1016 dumpwikibasejson.sh[2973260]: /usr/local/bin/wikibasedumps-shared.sh: line 52: /mnt/dumpsdata/otherdumps/wikibase/wikidatawiki/20250203/wikidata-20250203-md5sums.txt: No such file or directory
Feb 16 01:11:55 snapshot1016 dumpwikibasejson.sh[2973260]: sha1sum: /mnt/dumpsdata/otherdumps/wikibase/wikidatawiki/20250203/wikidata-20250203-all.json.gz: No such file or directory
Feb 16 01:11:55 snapshot1016 dumpwikibasejson.sh[2973260]: /usr/local/bin/wikibasedumps-shared.sh: line 55: /mnt/dumpsdata/otherdumps/wikibase/wikidatawiki/20250203/wikidata-20250203-sha1sums.txt: No such file or directory
Feb 16 01:11:55 snapshot1016 dumpwikibasejson.sh[2973260]: gzip: /mnt/dumpsdata/otherdumps/wikibase/wikidatawiki/20250203/wikidata-20250203-all.json.gz: No such file or directory
Feb 16 01:11:55 snapshot1016 dumpwikibasejson.sh[2973260]: mv: cannot move '/mnt/dumpsdata/xmldatadumps/temp/wikidata-all.json.bz2' to '/mnt/dumpsdata/otherdumps/wikibase/wikidatawiki/20250203/wikidata-20250203-all.json.bz2': No such file or directo>
Feb 16 01:11:55 snapshot1016 dumpwikibasejson.sh[2973260]: md5sum: /mnt/dumpsdata/otherdumps/wikibase/wikidatawiki/20250203/wikidata-20250203-all.json.bz2: No such file or directory
Feb 16 01:11:55 snapshot1016 dumpwikibasejson.sh[2973260]: /usr/local/bin/wikibasedumps-shared.sh: line 52: /mnt/dumpsdata/otherdumps/wikibase/wikidatawiki/20250203/wikidata-20250203-md5sums.txt: No such file or directory
Feb 16 01:11:55 snapshot1016 dumpwikibasejson.sh[2973260]: sha1sum: /mnt/dumpsdata/otherdumps/wikibase/wikidatawiki/20250203/wikidata-20250203-all.json.bz2: No such file or directory
Feb 16 01:11:55 snapshot1016 dumpwikibasejson.sh[2973260]: /usr/local/bin/wikibasedumps-shared.sh: line 55: /mnt/dumpsdata/otherdumps/wikibase/wikidatawiki/20250203/wikidata-20250203-sha1sums.txt: No such file or directory
Feb 16 01:11:55 snapshot1016 systemd[1]: wikidatajson-dump.service: Succeeded.
Feb 16 01:11:55 snapshot1016 systemd[1]: Finished Regular jobs to build json snapshot of wikidata.
Feb 16 01:11:55 snapshot1016 systemd[1]: Starting Regular jobs to build json snapshot of wikidata...

Will monitor this run.

FWIW, I wouldn’t have expected the RDF dumps issue to affect the JSON dumps at all. (But you’re right that the timing is pretty suspicious.)

We have an airflow job awaiting for this dump to finish. The job has failed because of timeout. If we manage to have a working dump at some point for post dates, please either restart the airflow job if you know how to, or let us know so that we do it :) Many thanks.

FYI structured_data.commons_entity is also stuck at 2025-01-20

Which job will generate a new wmf.wikidata_entity snapshot? Looking at this wikidata_dump_to_hive_weekly, there have been four failed runs since the last successful one (20250120) - and the current job is waiting for a sensor.

Looking at the last failed job it is waiting for a json dump on /wmf/data/raw/wikidata/dumps/all_json/20250224/_IMPORTED (which is Monday), but the dumps in that directory are weekly on Wednesday.

drwxr-x---   - analytics analytics-privatedata-users          0 2025-03-03 01:00 /wmf/data/raw/wikidata/dumps/all_json/20250120
drwxr-x---   - analytics analytics-privatedata-users          0 2025-03-03 01:00 /wmf/data/raw/wikidata/dumps/all_json/20250122
drwxr-x---   - analytics analytics-privatedata-users          0 2025-03-03 01:00 /wmf/data/raw/wikidata/dumps/all_json/20250127
drwxr-x---   - analytics analytics-privatedata-users          0 2025-03-03 01:00 /wmf/data/raw/wikidata/dumps/all_json/20250129
drwxr-x---   - analytics analytics-privatedata-users          0 2025-03-03 01:00 /wmf/data/raw/wikidata/dumps/all_json/20250205
drwxr-x---   - analytics analytics-privatedata-users          0 2025-03-03 01:00 /wmf/data/raw/wikidata/dumps/all_json/20250212
drwxr-x---   - analytics analytics-privatedata-users          0 2025-03-03 01:00 /wmf/data/raw/wikidata/dumps/all_json/20250219
drwxr-x---   - analytics analytics-privatedata-users          0 2025-03-03 01:00 /wmf/data/raw/wikidata/dumps/all_json/20250226

@fkaelin the Airflow DAG is ok, it has not been touched in more than a year.

The issue is upstream of Airflow. These dumps are generated on snapshot1016.eqiad.wmnet (see T386255#10561312). After they are generated, and available at https://dumps.wikimedia.org/other/wikibase/wikidatawiki/, there is a process that rsyncs them to HDFS. And only then will the Airflow sensor be able to succeed.

The problem is that snapshot1016.eqiad.wmnet has failed to generate these dumps for a while now. My speculation was that T384625 was the culprit, but as noted elsewhere:

T384625 should be fixed for a week now (I backported the fix last Monday), and https://dumps.wikimedia.org/wikidatawiki/entities/ has a truthy RDF dump from 25 February, but the full dumps are still missing, both in JSON and in RDF format… I have no idea why :/

We restarted the systemd service on T386255#10582712, but the restarted dump process continues to run 5 days in. It would typically run for 2-3 days.

This is unfortunate, as these dumps have typically being reliable. Perhaps, for some reason, we are not picking up the more recent Mediawiki version? I can't tell, unfortunately as I am not a MW expert.

CC @BTullis, just in case he has any suggestions.

Agreed this is likely related to T386255, as all the wikibase related dumps appear to not being able to finish. They would typically take 2-3 days, but they've all been running for much more than that:

xcollazo@snapshot1016:~$  systemctl status *wikidata*.service | grep Active -B 3
● wikidatajson-dump.service - Regular jobs to build json snapshot of wikidata
     Loaded: loaded (/lib/systemd/system/wikidatajson-dump.service; static)
     Active: activating (start) since Tue 2025-02-25 22:49:47 UTC; 6 days ago
--

● wikidatardf-all-dumps.service - Regular jobs to build rdf snapshot of wikidata
     Loaded: loaded (/lib/systemd/system/wikidatardf-all-dumps.service; static)
     Active: activating (start) since Mon 2025-02-17 23:00:00 UTC; 2 weeks 0 days ago
--
Warning: some journal files were not opened due to insufficient permissions.
● wikidatardf-truthy-dumps.service - Regular jobs to build rdf snapshot of wikidata truthy statements
     Loaded: loaded (/lib/systemd/system/wikidatardf-truthy-dumps.service; static)
     Active: activating (start) since Tue 2025-02-25 05:39:12 UTC; 1 weeks 0 days ago


xcollazo@snapshot1016:~$  systemctl status *common*.service | grep Active -B 3
● commonsjson-dump.service - Regular jobs to build json snapshot of commons structured data
     Loaded: loaded (/lib/systemd/system/commonsjson-dump.service; static)
     Active: activating (start) since Thu 2025-02-20 22:25:21 UTC; 1 weeks 4 days ago
--

● commonsrdf-dump.service - Regular jobs to build rdf snapshot of commons structured data
     Loaded: loaded (/lib/systemd/system/commonsrdf-dump.service; static)
     Active: activating (start) since Sat 2025-02-22 15:07:27 UTC; 1 weeks 2 days ago

I want to try the following:

  1. Stop all offending wikibase related dumps.
  2. Make sure next run they pickup latest MW version. Make sure they start a new dump.

So perhaps we should stop them all, and then let the associated systemd timers restart them later on.
Thus we want:

systemctl stop wikidatajson-dump.service

systemctl stop wikidatardf-all-dumps.service

systemctl stop wikidatardf-truthy-dumps.service

systemctl stop commonsjson-dump.service

systemctl stop commonsrdf-dump.service

Will ping SRE for help on this.

@bking ran the following:

bking@snapshot1016:~$ for n in wikidatajson-dump.service wikidatardf-all-dumps.service wikidatardf-truthy-dumps.service commonsjson-dump.service  commonsrdf-dump.service; do sudo systemctl stop ${n}; done
Warning: Stopping wikidatajson-dump.service, but it can still be activated by:
  wikidatajson-dump.timer
Warning: Stopping wikidatardf-all-dumps.service, but it can still be activated by:
  wikidatardf-all-dumps.timer
Warning: Stopping wikidatardf-truthy-dumps.service, but it can still be activated by:
  wikidatardf-truthy-dumps.timer
Warning: Stopping commonsjson-dump.service, but it can still be activated by:
  commonsjson-dump.timer
Warning: Stopping commonsrdf-dump.service, but it can still be activated by:
  commonsrdf-dump.timer

But then all the dumps came back, presumably because there was a backlog of systemd timers for each of these dumps.

@bking and I huddled, and Brian helped kill all the stray processes, and we should now be back to the natural timed runs for each dump:

systemctl status *wikidata*.timer
● wikidatardf-truthy-dumps.timer - Periodic execution of wikidatardf-truthy-dumps.service
     Loaded: loaded (/lib/systemd/system/wikidatardf-truthy-dumps.timer; enabled; vendor preset: enabled)
     Active: active (waiting) since Thu 2024-09-19 09:28:29 UTC; 5 months 14 days ago
    Trigger: Wed 2025-03-05 23:00:00 UTC; 1 day 7h left
   Triggers: ● wikidatardf-truthy-dumps.service

Warning: some journal files were not opened due to insufficient permissions.
● wikidatajson-lexemes-dump.timer - Periodic execution of wikidatajson-lexemes-dump.service
     Loaded: loaded (/lib/systemd/system/wikidatajson-lexemes-dump.timer; enabled; vendor preset: enabled)
     Active: active (waiting) since Thu 2024-09-19 09:28:28 UTC; 5 months 14 days ago
    Trigger: Wed 2025-03-05 03:15:00 UTC; 11h left
   Triggers: ● wikidatajson-lexemes-dump.service

Warning: some journal files were not opened due to insufficient permissions.
● wikidatardf-all-dumps.timer - Periodic execution of wikidatardf-all-dumps.service
     Loaded: loaded (/lib/systemd/system/wikidatardf-all-dumps.timer; enabled; vendor preset: enabled)
     Active: active (waiting) since Thu 2024-09-19 09:28:28 UTC; 5 months 14 days ago
    Trigger: Mon 2025-03-10 23:00:00 UTC; 6 days left
   Triggers: ● wikidatardf-all-dumps.service

Warning: some journal files were not opened due to insufficient permissions.
● wikidatajson-dump.timer - Periodic execution of wikidatajson-dump.service
     Loaded: loaded (/lib/systemd/system/wikidatajson-dump.timer; enabled; vendor preset: enabled)
     Active: active (waiting) since Thu 2024-09-19 09:28:27 UTC; 5 months 14 days ago
    Trigger: Mon 2025-03-10 03:15:00 UTC; 5 days left
   Triggers: ● wikidatajson-dump.service

Warning: some journal files were not opened due to insufficient permissions.
● wikidatardf-lexemes-dumps.timer - Periodic execution of wikidatardf-lexemes-dumps.service
     Loaded: loaded (/lib/systemd/system/wikidatardf-lexemes-dumps.timer; enabled; vendor preset: enabled)
     Active: active (waiting) since Thu 2024-09-19 09:28:29 UTC; 5 months 14 days ago
    Trigger: Fri 2025-03-07 23:00:00 UTC; 3 days left
   Triggers: ● wikidatardf-lexemes-dumps.service


systemctl status *commons*.timer
● commonsrdf-dump.timer - Periodic execution of commonsrdf-dump.service
     Loaded: loaded (/lib/systemd/system/commonsrdf-dump.timer; enabled; vendor preset: enabled)
     Active: active (waiting) since Thu 2024-09-19 09:28:30 UTC; 5 months 14 days ago
    Trigger: Sun 2025-03-09 19:00:00 UTC; 5 days left
   Triggers: ● commonsrdf-dump.service

Warning: some journal files were not opened due to insufficient permissions.
● commonsjson-dump.timer - Periodic execution of commonsjson-dump.service
     Loaded: loaded (/lib/systemd/system/commonsjson-dump.timer; enabled; vendor preset: enabled)
     Active: active (waiting) since Thu 2024-09-19 09:28:30 UTC; 5 months 14 days ago
    Trigger: Mon 2025-03-10 03:15:00 UTC; 5 days left
   Triggers: ● commonsjson-dump.service

This does mean that we need to wait for a while for the next runs of these dumps as per above. Specifically, wikidatajson-dump.service will run next on Mon 2025-03-10 03:15:00 UTC; 5 days left.

This does mean that we need to wait for a while for the next runs of these dumps as per above. Specifically, wikidatajson-dump.service will run next on Mon 2025-03-10 03:15:00 UTC; 5 days left.

wikidatajson-dump.service seems to be running again – I see some promising temporary output files in /mnt/dumpsdata/xmldatadumps/temp/ on snapshot1016, and no errors in the logs at /var/log/wikidatadump/dumpwikidata-wikidata-20250310-*. Let’s see if it manages to successfully finish the dump and merge and publish the output.

Still stuck at 2025-01-20

We'd expect the run to be done by now if it was going to be successful, right?

This does mean that we need to wait for a while for the next runs of these dumps as per above. Specifically, wikidatajson-dump.service will run next on Mon 2025-03-10 03:15:00 UTC; 5 days left.

wikidatajson-dump.service seems to be running again – I see some promising temporary output files in /mnt/dumpsdata/xmldatadumps/temp/ on snapshot1016, and no errors in the logs at /var/log/wikidatadump/dumpwikidata-wikidata-20250310-*. Let’s see if it manages to successfully finish the dump and merge and publish the output.

While super slow compared to dumps prior, the current wikidatajson-dump.service dump run is still making progress. Here is a listing of its temporary files, ordered by last touched descending (full results on P74222):

xcollazo@snapshot1016:/mnt/dumpsdata/xmldatadumps/temp$ ls -lsha -t *wikidata*json*
151M -rw-r--r-- 1 dumpsgen dumpsgen 151M Mar 13 14:09 wikidata-all.7-batch70.json.gz
151M -rw-r--r-- 1 dumpsgen dumpsgen 151M Mar 13 14:09 wikidata-all.6-batch70.json.gz
150M -rw-r--r-- 1 dumpsgen dumpsgen 150M Mar 13 14:09 wikidata-all.5-batch70.json.gz
150M -rw-r--r-- 1 dumpsgen dumpsgen 150M Mar 13 14:09 wikidata-all.4-batch70.json.gz
152M -rw-r--r-- 1 dumpsgen dumpsgen 152M Mar 13 14:09 wikidata-all.3-batch70.json.gz
149M -rw-r--r-- 1 dumpsgen dumpsgen 149M Mar 13 14:09 wikidata-all.2-batch70.json.gz
150M -rw-r--r-- 1 dumpsgen dumpsgen 150M Mar 13 14:09 wikidata-all.1-batch70.json.gz
152M -rw-r--r-- 1 dumpsgen dumpsgen 152M Mar 13 14:09 wikidata-all.0-batch70.json.gz
162M -rw-r--r-- 1 dumpsgen dumpsgen 162M Mar 13 12:56 wikidata-all.2-batch69.json.gz
161M -rw-r--r-- 1 dumpsgen dumpsgen 161M Mar 13 12:56 wikidata-all.4-batch69.json.gz
159M -rw-r--r-- 1 dumpsgen dumpsgen 159M Mar 13 12:55 wikidata-all.5-batch69.json.gz
159M -rw-r--r-- 1 dumpsgen dumpsgen 159M Mar 13 12:55 wikidata-all.6-batch69.json.gz
162M -rw-r--r-- 1 dumpsgen dumpsgen 162M Mar 13 12:54 wikidata-all.1-batch69.json.gz
158M -rw-r--r-- 1 dumpsgen dumpsgen 158M Mar 13 12:53 wikidata-all.3-batch69.json.gz
162M -rw-r--r-- 1 dumpsgen dumpsgen 162M Mar 13 12:53 wikidata-all.0-batch69.json.gz
161M -rw-r--r-- 1 dumpsgen dumpsgen 161M Mar 13 12:52 wikidata-all.7-batch69.json.gz
217M -rw-r--r-- 1 dumpsgen dumpsgen 217M Mar 13 11:32 wikidata-all.5-batch68.json.gz
221M -rw-r--r-- 1 dumpsgen dumpsgen 221M Mar 13 11:31 wikidata-all.2-batch68.json.gz
218M -rw-r--r-- 1 dumpsgen dumpsgen 218M Mar 13 11:31 wikidata-all.4-batch68.json.gz
218M -rw-r--r-- 1 dumpsgen dumpsgen 218M Mar 13 11:30 wikidata-all.6-batch68.json.gz
218M -rw-r--r-- 1 dumpsgen dumpsgen 218M Mar 13 11:29 wikidata-all.3-batch68.json.gz
221M -rw-r--r-- 1 dumpsgen dumpsgen 221M Mar 13 11:29 wikidata-all.1-batch68.json.gz
220M -rw-r--r-- 1 dumpsgen dumpsgen 220M Mar 13 11:29 wikidata-all.0-batch68.json.gz
219M -rw-r--r-- 1 dumpsgen dumpsgen 219M Mar 13 11:26 wikidata-all.7-batch68.json.gz
...

So the dumps are indeed moving forward, and each batch is finishing fully. Super slowly, but fully.

This got me thinking that this behavior started happening when we switched the dumps from hitting production replicas to the analytic replicas. dbstore1009, which serves wikidatawiki, seems quite idle as well over the last 2 days.

Perhaps this job is being throttled somehow? CC @BTullis.

So the dumps are indeed moving forward, and each batch is finishing fully. Super slowly, but fully.

This got me thinking that this behavior started happening when we switched the dumps from hitting production replicas to the analytic replicas. dbstore1009, which serves wikidatawiki, seems quite idle as well over the last 2 days.

Just for reference, that MariaDB graph is showing the performance of the s6 shard, which is running on port 3316. If you drop down the port number and select 3318, this will show s8 which is the wikidata section.
https://grafana.wikimedia.org/goto/UejX3dhNg?orgId=1

This isn't idle and it's currently doing about 750 ops/s

I also thought I'd check whether there were any differences in the indexes between dbstore1009 and db1167, which is where it would be running if we hadn't overridden it.

I ran the following mariadb query on both servers:

SELECT DISTINCT TABLE_NAME, INDEX_NAME FROM INFORMATION_SCHEMA.STATISTICS WHERE TABLE_SCHEMA = 'wikidatawiki' ORDER BY TABLE_NAME;

The short answer is that both servers returned identical lists of indexes.

There might be a difference in the my.cnf settings for these two servers, but based on the host overview, I don't think that it's a database host saturation problem.

If you drop down the port number and select 3318, this will show s8 which is the wikidata section.

Ah what a silly mistake on my part, my bad!

I don't think that it's a database host saturation problem.

Agreed. Even at 750 ops/s that seems quite nominal.

I also thought I'd check whether there were any differences in the indexes between dbstore1009 and db1167

@BTullis Would the wikidata XML dumps had also ran against db1167 in the old scenario? I ask because they are also concurrently running against dbstore1009.

I also thought I'd check whether there were any differences in the indexes between dbstore1009 and db1167

@BTullis Would the wikidata XML dumps had also ran against db1167 in the old scenario? I ask because they are also concurrently running against dbstore1009.

Yes, I believe that both types would also have run concurrently against db1167 prior to the switch. I'm getting that information from https://noc.wikimedia.org/dbconfig/eqiad.json and looking at groupLoadsBySection:s8

Change #1128386 had a related patch set uploaded (by Btullis; author: Btullis):

[operations/puppet@production] dumps: Stop using the analytics replicas for misc dumps

https://gerrit.wikimedia.org/r/1128386

Thank you for further looking into this.
We have several people asking about the dumps. Do we have any ETA for the current run?

Change #1128386 merged by Btullis:

[operations/puppet@production] dumps: Stop using the analytics replicas for misc dumps

https://gerrit.wikimedia.org/r/1128386

Thank you for further looking into this.
We have several people asking about the dumps. Do we have any ETA for the current run?

With @BTullis's https://gerrit.wikimedia.org/r/1128386 patch, we think the performance regression is fixed.

Elsewhere, @dcausse had shared a way to estimate:

roughly estimated 64k chunk done in 75min over 8 parallel thread, this given 116M items to export is roughly 11.7days

Recent batches after the patch being effective now take ~20 mins, and we expect to have a total of 116000000/(64000 *8) = 227 batches.

We are currently at batch 149. Thus ((227 total batches - 149 done batches) * 20 mins / batch ) / 60 mins/hour = ~26 hours.

(That is for the dump to finish, *not* for the downstream pipelines to be done).

I have good news and bad news.

The good news is that the wikibase dumps completed even more rapidly than @xcollazo has estimated above.
The 20250310 dump of wikidata-all completed at around 05:10 this morning, having completed 243 batches without error.

btullis@snapshot1016:/var/log/wikidatadump$ for i in $(ls /var/log/wikidatadump/dumpwikidata-wikidata-20250310-all-?.json.log); do echo $i ; grep Starting $i | tail -n 1 ; tail -n 1 $i ; done
/var/log/wikidatadump/dumpwikidata-wikidata-20250310-all-0.json.log
(2025-03-18T05:24+00:00) Starting batch 243
Processed 97848 entities.
/var/log/wikidatadump/dumpwikidata-wikidata-20250310-all-1.json.log
(2025-03-18T05:24+00:00) Starting batch 243
Processed 97955 entities.
/var/log/wikidatadump/dumpwikidata-wikidata-20250310-all-2.json.log
(2025-03-18T05:24+00:00) Starting batch 243
Processed 98302 entities.
/var/log/wikidatadump/dumpwikidata-wikidata-20250310-all-3.json.log
(2025-03-18T05:24+00:00) Starting batch 243
Processed 97909 entities.
/var/log/wikidatadump/dumpwikidata-wikidata-20250310-all-4.json.log
(2025-03-18T05:24+00:00) Starting batch 243
Processed 98167 entities.
/var/log/wikidatadump/dumpwikidata-wikidata-20250310-all-5.json.log
(2025-03-18T05:24+00:00) Starting batch 243
Processed 97316 entities.
/var/log/wikidatadump/dumpwikidata-wikidata-20250310-all-6.json.log
(2025-03-18T05:24+00:00) Starting batch 243
Processed 98597 entities.
/var/log/wikidatadump/dumpwikidata-wikidata-20250310-all-7.json.log
(2025-03-18T05:29+00:00) Starting batch 243
Processed 98153 entities.

The bad news is that the dump seems to have been deleted before being synchronized to the https://dumps.wikimedia.org/wikidatawiki/entities/20250310/

There are only three directories shown on the intermediate NFS server, which are 20250312, 20250314, and 20250318

image.png (393×1 px, 160 KB)

I have yet to track down what that housekeeping process is, but I will continue to do so.

Today's dump (20250318) is proceeding at the proper speed and it seems to be completing a batch rouchly every 5 minutes.

btullis@snapshot1016:/var/log/wikidatadump$ for i in $(ls /var/log/wikidatadump/dumpwikidata-wikidata-20250318-all-?.json.log); do echo $i ; grep Starting $i | tail -n 5 ; done
/var/log/wikidatadump/dumpwikidata-wikidata-20250318-all-0.json.log
(2025-03-18T11:28+00:00) Starting batch 40
(2025-03-18T11:35+00:00) Starting batch 41
(2025-03-18T11:40+00:00) Starting batch 42
(2025-03-18T11:46+00:00) Starting batch 43
(2025-03-18T11:50+00:00) Starting batch 44
/var/log/wikidatadump/dumpwikidata-wikidata-20250318-all-1.json.log
(2025-03-18T11:28+00:00) Starting batch 40
(2025-03-18T11:35+00:00) Starting batch 41
(2025-03-18T11:40+00:00) Starting batch 42
(2025-03-18T11:46+00:00) Starting batch 43
(2025-03-18T11:50+00:00) Starting batch 44
/var/log/wikidatadump/dumpwikidata-wikidata-20250318-all-2.json.log
(2025-03-18T11:28+00:00) Starting batch 40
(2025-03-18T11:35+00:00) Starting batch 41
(2025-03-18T11:40+00:00) Starting batch 42
(2025-03-18T11:46+00:00) Starting batch 43
(2025-03-18T11:50+00:00) Starting batch 44
/var/log/wikidatadump/dumpwikidata-wikidata-20250318-all-3.json.log
(2025-03-18T11:28+00:00) Starting batch 40
(2025-03-18T11:35+00:00) Starting batch 41
(2025-03-18T11:40+00:00) Starting batch 42
(2025-03-18T11:46+00:00) Starting batch 43
(2025-03-18T11:50+00:00) Starting batch 44
/var/log/wikidatadump/dumpwikidata-wikidata-20250318-all-4.json.log
(2025-03-18T11:28+00:00) Starting batch 40
(2025-03-18T11:35+00:00) Starting batch 41
(2025-03-18T11:40+00:00) Starting batch 42
(2025-03-18T11:46+00:00) Starting batch 43
(2025-03-18T11:50+00:00) Starting batch 44
/var/log/wikidatadump/dumpwikidata-wikidata-20250318-all-5.json.log
(2025-03-18T11:28+00:00) Starting batch 40
(2025-03-18T11:35+00:00) Starting batch 41
(2025-03-18T11:40+00:00) Starting batch 42
(2025-03-18T11:46+00:00) Starting batch 43
(2025-03-18T11:50+00:00) Starting batch 44
/var/log/wikidatadump/dumpwikidata-wikidata-20250318-all-6.json.log
(2025-03-18T11:28+00:00) Starting batch 40
(2025-03-18T11:35+00:00) Starting batch 41
(2025-03-18T11:40+00:00) Starting batch 42
(2025-03-18T11:46+00:00) Starting batch 43
(2025-03-18T11:50+00:00) Starting batch 44
/var/log/wikidatadump/dumpwikidata-wikidata-20250318-all-7.json.log
(2025-03-18T11:28+00:00) Starting batch 40
(2025-03-18T11:35+00:00) Starting batch 41
(2025-03-18T11:40+00:00) Starting batch 42
(2025-03-18T11:46+00:00) Starting batch 43
(2025-03-18T11:50+00:00) Starting batch 44

If we estimate that there are 200 batches remaining, at 5 minutes per batch, that would give us ~17 hours until 20250318 is completed.

The bad news is that the dump seems to have been deleted before being synchronized to the https://dumps.wikimedia.org/wikidatawiki/entities/20250310/

For internal purposes though, we did find what appears to be a legit copy of the final artficat for the 20250310 dump at dumpsdata1003:/data/xmldatadumps/temp/wikidata-all.json.gz.

@BTullis copied it to my home folder on stat1011, and from there I did the following:

From stat1011:

xcollazo@stat1011:~$ hdfs dfs -put wikidata-all.json.gz /user/xcollazo/artifacts

xcollazo@stat1011:~$ hdfs dfs -ls /user/xcollazo/artifacts | grep wikidata
Picked up JAVA_TOOL_OPTIONS: -Dfile.encoding=UTF-8
-rw-r-----   3 xcollazo xcollazo 144592690000 2025-03-18 17:37 /user/xcollazo/artifacts/wikidata-all.json.gz

Now let's move it to its final spot, plus the _IMPORTED flag and proper permissions. But we need to be hdfs for all this:

xcollazo@an-launcher1002:~$ hostname -f
an-launcher1002.eqiad.wmnet

xcollazo@an-launcher1002:~$ sudo -u hdfs hdfs dfs -mkdir /wmf/data/raw/wikidata/dumps/all_json/20250310

xcollazo@an-launcher1002:~$ sudo -u hdfs hdfs dfs -mv /user/xcollazo/artifacts/wikidata-all.json.gz /wmf/data/raw/wikidata/dumps/all_json/20250310/

xcollazo@an-launcher1002:~$ sudo -u hdfs hdfs dfs -touchz /wmf/data/raw/wikidata/dumps/all_json/20250310/_IMPORTED

xcollazo@an-launcher1002:~$ sudo -u hdfs hdfs dfs -chown -R analytics /wmf/data/raw/wikidata/dumps/all_json/20250310

xcollazo@an-launcher1002:~$ sudo -u hdfs hdfs dfs -chgrp -R analytics-privatedata-users /wmf/data/raw/wikidata/dumps/all_json/20250310

xcollazo@an-launcher1002:~$ sudo -u hdfs hdfs dfs -ls /wmf/data/raw/wikidata/dumps/all_json/20250310
Found 2 items
-rw-r-----   3 analytics analytics-privatedata-users            0 2025-03-18 17:47 /wmf/data/raw/wikidata/dumps/all_json/20250310/_IMPORTED
-rw-r-----   3 analytics analytics-privatedata-users 144592690000 2025-03-18 17:37 /wmf/data/raw/wikidata/dumps/all_json/20250310/wikidata-all.json.gz

The Airflow sensor was successful.

Now the import to hive job is running in Yarn. This job typically takes ~1 hour to run. Once it finishes, I will do some manual verifications to check whether it looks ok compared to a previous snapshot.

Now the import to hive job is running in Yarn. This job typically takes ~1 hour to run. Once it finishes, I will do some manual verifications to check whether it looks ok compared to a previous snapshot.

The job seems like it will take way more time than expected, as it is currently at 2 hours with 4.5M rows processed. Since we have ~116M rows, it should take at least ~51 hours more. In hindsight, I suspect this is probably because the regular job would hdfs-rsync the bzip2 version of the dumps, which is hadoop splittable, instead of the gunzip file, which is not. Thus, there is the one executor reading the file right now instead of multiples.

I am going to kill the job now and attempt to recompress the file accordingly.

Running the following now on stat1011:

gunzip -c wikidata-all.json.gz | bzip2 > wikidata-all.json.bz2

Will check later today for progress.

For reference, the current wikidata-20250318-all dump is at around 162 batches of 243 expected.

Every 5.0s: for i in $(ls /var/log/wikidatadump/dumpwikidata-wikidata-20250318-all-?.json.log); do echo $i ; grep Starting $i | tail -n 1 ; tail -n 1 $i ; done              snapshot1016: Wed Mar 19 12:55:54 2025

/var/log/wikidatadump/dumpwikidata-wikidata-20250318-all-0.json.log
(2025-03-19T12:46+00:00) Starting batch 161
Processed 53770 entities.
/var/log/wikidatadump/dumpwikidata-wikidata-20250318-all-1.json.log
(2025-03-19T12:46+00:00) Starting batch 161
Processed 53454 entities.
/var/log/wikidatadump/dumpwikidata-wikidata-20250318-all-2.json.log
(2025-03-19T12:51+00:00) Starting batch 161
Processed 23072 entities.
/var/log/wikidatadump/dumpwikidata-wikidata-20250318-all-3.json.log
(2025-03-19T12:48+00:00) Starting batch 158
Processed 46269 entities.
/var/log/wikidatadump/dumpwikidata-wikidata-20250318-all-4.json.log
(2025-03-19T12:47+00:00) Starting batch 155
Processed 51129 entities.
/var/log/wikidatadump/dumpwikidata-wikidata-20250318-all-5.json.log
(2025-03-19T12:50+00:00) Starting batch 159
Processed 32579 entities.
/var/log/wikidatadump/dumpwikidata-wikidata-20250318-all-6.json.log
(2025-03-19T12:48+00:00) Starting batch 156
Processed 43228 entities.
/var/log/wikidatadump/dumpwikidata-wikidata-20250318-all-7.json.log
(2025-03-19T12:45+00:00) Starting batch 158
Processed 57115 entities.

Running the following now on stat1011:

gunzip -c wikidata-all.json.gz | bzip2 > wikidata-all.json.bz2

Will check later today for progress.

It was silly of me to think that a local run on a stat machine will be any faster, as decompressing a .gz file can only saturate the one CPU core. The conversion is currently at 22GB/135GB. I will let it finish, but it looks like the regular run for 20250318 that @BTullis describes on T386255#10651754 will finish first.

It probably won't matter as things are in flight: but it may be possible to speed up the decompression with pigz. We don't seem to have pbzip2 right now (it seems like years ago maybe it was use? unless I'm searching incorrectly), although that might be helpful for speeding along the creation of the bz2 (bzip2 will be slow) if it would be an acceptable package.

It probably won't matter as things are in flight: but it may be possible to speed up the decompression with pigz

Thanks @dr0ptp4kt . Took a quick look, it seems pigz can compress in parallel, but can only decompress in one thread.

My experiment is still at 30G/135GB, so best bet here is to follow up on T386255#10651754.

Thanks @dr0ptp4kt . Took a quick look, it seems pigz can compress in parallel, but can only decompress in one thread.

Oh, that's right, I see in the manual:

Decompression can't be parallelized, at least not without specially prepared deflate streams for that purpose. As a result, pigz uses a single thread (the main thread) for decompression, but will create three other threads for reading, writing, and check calculation, which can speed up decompression under some circumstances.

Sorry for the bother.

My experiment is still at 30G/135GB, so best bet here is to follow up on T386255#10651754.

Yeah.

Gzipped download files landed on https://dumps.wikimedia.org/other/wikibase/wikidatawiki/20250318/
Re-zipping for other compression formats is in progress.

Ok good news are starting to trickle in.

The 20250310 run of the commonswiki rdf dumps, which were also previously running very very slow, have also finished successfully: https://dumps.wikimedia.org/other/wikibase/commonswiki/20250310/

The corresponding Airflow DAG has also run, and the data is now available in Hive as well:

spark-sql (default)> show partitions structured_data.commons_entity;
partition
snapshot=2024-12-16
snapshot=2024-12-23
snapshot=2024-12-30
snapshot=2025-01-06
snapshot=2025-01-13
snapshot=2025-01-20
snapshot=2025-03-03
Time taken: 0.385 seconds, Fetched 7 row(s)

Interestingly, the Airflow job loads the dump as 2025-03-03 instead of 2025-03-10. But that is existing behavior and out of scope from this ticket.

CC @Cparle @mfossati

xcollazo opened https://gitlab.wikimedia.org/repos/data-engineering/airflow-dags/-/merge_requests/1167

analytics: Allow dump_location override for wikidata_dump_to_hive_weekly DAG.

xcollazo merged https://gitlab.wikimedia.org/repos/data-engineering/airflow-dags/-/merge_requests/1167

analytics: Allow dump_location override for wikidata_dump_to_hive_weekly DAG.

Partition snapshot=2025-03-10 for wmf.wikidata_entity now available in the datalake:

spark-sql (default)> show partitions wmf.wikidata_entity;
partition
snapshot=2024-12-16
snapshot=2024-12-23
snapshot=2024-12-30
snapshot=2025-01-06
snapshot=2025-01-13
snapshot=2025-01-20
snapshot=2025-03-10

Partition snapshot=2025-03-10 for wmf.wikidata_item_page_link now available in the datalake:

spark-sql (default)> show partitions wmf.wikidata_item_page_link;
partition
snapshot=2024-12-16
snapshot=2024-12-23
snapshot=2024-12-30
snapshot=2025-01-06
snapshot=2025-01-13
snapshot=2025-01-20
snapshot=2025-03-10

Ok, to recap:

I think this concludes this particular Dumps 1 saga. Closing.

Interestingly, the Airflow job loads the dump as 2025-03-03 instead of 2025-03-10. But that is existing behavior and out of scope from this ticket.

That's odd.
wmf.wikidata_item_page_link, wmf.wikidata_entity but also structured_data.commons_entity all have historical data up until 2025-01-20, which makes it look like it used to match in the past (or structured_data.commons_entity managed to complete 1 more last run while the others had failed)

I filed T389601 to fix those date discrepancies.

Ok, to recap:

I think this concludes this particular Dumps 1 saga. Closing.

Sounds good to me, thanks a lot!

@Hannah_Bast: Please file new separate tickets for new problems - thanks.

@Aklapper As a "Bug Report" or as a "Production Error"?

Change #1120647 merged by jenkins-bot:

[analytics/refinery/source@master] Delete dead code that used to generate wmf.wikidata_item_page_link.

https://gerrit.wikimedia.org/r/1120647