Page MenuHomePhabricator

structured_data.commons_entity stuck at 2025-01-20
Closed, ResolvedPublic

Description

spark.sql('show partitions structured_data.commons_entity').show()

+-------------------+
|          partition|
+-------------------+
|snapshot=2024-12-16|
|snapshot=2024-12-23|
|snapshot=2024-12-30|
|snapshot=2025-01-06|
|snapshot=2025-01-13|
|snapshot=2025-01-20|
+-------------------+

Maybe related to T386255: wmf.wikidata_item_page_link and wmf.wikidata_entity snapshots stuck at 2025-01-20? Later data seems to have been successfully dumped though (see https://dumps.wikimedia.org/commonswiki/entities/20250203/) so maybe it's just a problem with ingestion into hive

Downstream tracking task: T385865: Resume data pipeline operations

Event Timeline

Agreed this is likely related to T386255, as all the wikibase related dumps appear to not being able to finish. They would typically take 2-3 days, but they've all been running for much more than that:

xcollazo@snapshot1016:~$  systemctl status *wikidata*.service | grep Active -B 3
● wikidatajson-dump.service - Regular jobs to build json snapshot of wikidata
     Loaded: loaded (/lib/systemd/system/wikidatajson-dump.service; static)
     Active: activating (start) since Tue 2025-02-25 22:49:47 UTC; 6 days ago
--

● wikidatardf-all-dumps.service - Regular jobs to build rdf snapshot of wikidata
     Loaded: loaded (/lib/systemd/system/wikidatardf-all-dumps.service; static)
     Active: activating (start) since Mon 2025-02-17 23:00:00 UTC; 2 weeks 0 days ago
--
Warning: some journal files were not opened due to insufficient permissions.
● wikidatardf-truthy-dumps.service - Regular jobs to build rdf snapshot of wikidata truthy statements
     Loaded: loaded (/lib/systemd/system/wikidatardf-truthy-dumps.service; static)
     Active: activating (start) since Tue 2025-02-25 05:39:12 UTC; 1 weeks 0 days ago


xcollazo@snapshot1016:~$  systemctl status *common*.service | grep Active -B 3
● commonsjson-dump.service - Regular jobs to build json snapshot of commons structured data
     Loaded: loaded (/lib/systemd/system/commonsjson-dump.service; static)
     Active: activating (start) since Thu 2025-02-20 22:25:21 UTC; 1 weeks 4 days ago
--

● commonsrdf-dump.service - Regular jobs to build rdf snapshot of commons structured data
     Loaded: loaded (/lib/systemd/system/commonsrdf-dump.service; static)
     Active: activating (start) since Sat 2025-02-22 15:07:27 UTC; 1 weeks 2 days ago
show partitions commons_entity;
OK
partition
snapshot=2025-01-06
snapshot=2025-01-13
snapshot=2025-01-20
snapshot=2025-03-03
snapshot=2025-03-10
snapshot=2025-03-17