Skip Wikidata when loading XML dumps to the Data Lake
Closed, ResolvedPublic5 Estimated Story Points
Actions

Assigned To

Authored By

	nshahquinn-wmf
	Feb 18 2024, 6:31 AM

Description

As part of SDS 2.6.2, I've been investigating the data dependencies of the movement metrics. Our critical path takes around 25 days and goes:

XML dumps generation
loading XML dumps to HDFS (Python script, template for running script, Puppet management of SystemD timers running script)
mediawiki_wikitext_history
research_article_quality (Airflow DAG, code)
knowledge_gaps (Airflow DAG, code)

By far the longest portion (~19 days) is waiting for the XML dumps to be generated. But after the first 7 days (when the English Wikipedia dump arrives), we're waiting only on the Wikidata dump. I doubt that anyone is regularly using the Wikidata XML dump since wmf.wikidata_entity (which comes from the JSON dump) is much better and faster. The XML dump is apparently the only one that contains non-current data, but that's probably a very rare need.

Can we skip loading the Wikidata XML altogether? Other strategies like splitting it out as a separate job would be fine too, but just skipping it would be much easier and likely fine, with no one using the data.

Details

	Subject	Repo	Branch	Lines +/-
	Update analytics mediawiki_dumps_import	operations/puppet	production	+11 -0
	Add skip-list option to import_mediawiki_dumps	analytics/refinery	master	+20 -13

Customize query in gerrit

Related Objects
Search...

		Status	Subtype	Assigned	Task
		Open		nshahquinn-wmf	T359207 Improve the delivery of the movement movements (SDS 2.6.2)
		Resolved		JAllemandou	T357859 Skip Wikidata when loading XML dumps to the Data Lake

Event Timeline

nshahquinn-wmf created this task.Feb 18 2024, 6:31 AM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptFeb 18 2024, 6:31 AM

nshahquinn-wmf updated the task description. (Show Details)Feb 18 2024, 7:04 AM

nshahquinn-wmf added subscribers: OSefu-WMF, Hghani, fkaelin and 4 others.Feb 18 2024, 10:21 PM

lbowmaker subscribed.Feb 19 2024, 12:56 PM

lbowmaker moved this task from Incoming (new tickets) to To be estimated/discussed on the Data-Engineering board.Feb 19 2024, 3:36 PM

JAllemandou claimed this task.Feb 21 2024, 6:07 PM

JAllemandou edited projects, added Data-Engineering (Sprint 9); removed Data-Engineering.

BTullis subscribed.Feb 21 2024, 6:09 PM

Implementation plan:

Add a new skip option in https://github.com/wikimedia/analytics-refinery/blob/master/bin/import-mediawiki-dumps#L29 to skip wikis from the wiki-list file the job reads.
Use this new option to skip wikidatawiki in the puppet setup systemdtimer:

Excellent! Don't forget to announce the plan first, just in case there is someone unexpectedly using the data; I recommend the working-with-data Slack channel and the analytics-announce mailing list.

nshahquinn-wmf moved this task from Incoming to Watching on the Movement-Insights board.Feb 21 2024, 11:59 PM

lbowmaker mentioned this in T357438: Remove wikidata from this historical dumps process.Feb 22 2024, 12:03 PM

lbowmaker set the point value for this task to 5.

amastilovic subscribed.Feb 22 2024, 10:08 PM

nshahquinn-wmf mentioned this in T357873: Mediawiki_wikitext_history job often has long gaps between stages.Feb 22 2024, 11:21 PM

nshahquinn-wmf merged a task: T357438: Remove wikidata from this historical dumps process.Feb 22 2024, 11:26 PM

nshahquinn-wmf added a project: Movement-Metrics.

nshahquinn-wmf moved this task from Backlog to Upstream on the Movement-Metrics board.

xcollazo mentioned this in T358366: Consult with Product and Research team on schema and data retention expectations for wmf_dumps.wikitext_raw.Feb 23 2024, 6:13 PM

JAllemandou moved this task from Next Up to In progress on the Data-Engineering (Sprint 9) board.Feb 27 2024, 4:30 PM

Change 1006957 had a related patch set uploaded (by Joal; author: Joal):

[analytics/refinery@master] Add deny-list option to import_mediawiki_dumps