Automate the loading of canonical data tables to the Data Lake
Open, MediumPublic
Actions

Assigned To

None

Authored By

	Antoine_Quhen
	Jun 20 2023, 2:13 PM

Description

The canonical data tables have been widely adopted by data engineers, research, and analysts.

The current workflow for changes is that first the TSV file in the Git repository is updated, and then a user with the appropriate permissions manually runs a Jupyter notebook that uploads the contents to the TSV to the Data Lake table.

This is unnecessary toil and creates the potential for subtle issues if a change is merged to the repo but not loaded to the Hive table, which is almost always what is used (including by automated jobs).

@Antoine_Quhen has already drafted an Airflow pipeline for loading the countries dataset, and this could easily be expanded to cover the wiki dataset as well:

Some optional ideas:

Switch the pyspark script to an HQL script (e.g. with a tmp view)

CREATE TEMPORARY VIEW tmp_countries USING csv OPTIONS (path 'file:///countries.tsv', header true);

Format the table as Iceberg (to keep the history of the table, including schema & data)
Consider whether the loading can be done as a post-commit hook (rather than at a fixed frequency)

Related Objects
Search...

		Status	Subtype	Assigned	Task
		Open		nshahquinn-wmf	T369207 Canonical Data \| Update, Maintain, Improve
		Open		None	T339928 Automate the loading of canonical data tables to the Data Lake

Event Timeline

Antoine_Quhen created this task.Jun 20 2023, 2:13 PM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptJun 20 2023, 2:13 PM

Antoine_Quhen updated the task description. (Show Details)Jun 20 2023, 3:46 PM

JArguello-WMF moved this task from Incoming (new tickets) to Data Products & Metrics on the Data-Engineering board.Jun 29 2023, 11:14 PM

nshahquinn-wmf subscribed.Aug 16 2023, 2:31 AM

nshahquinn-wmf added a project: Movement-Insights.Aug 16 2023, 2:41 AM

JAnstee_WMF moved this task from Incoming to Backlog on the Movement-Insights board.Aug 22 2023, 6:29 PM

nshahquinn-wmf mentioned this in T190700: Automate creation of sqoop list of wikis to import data for from sitematrix.Oct 25 2023, 6:07 PM

• lbowmaker moved this task from Data Products & Metrics to Icebox (not considered in current quarter) on the Data-Engineering board.Nov 10 2023, 2:40 PM

JAnstee_WMF mentioned this in T352686: Canonical Requests.Dec 4 2023, 4:08 PM

JAnstee_WMF triaged this task as High priority.Dec 6 2023, 9:33 PM

JAnstee_WMF lowered the priority of this task from High to Medium.

For now, I don't think there are any serious questions about stewardship of the content of the canonical data tables. I've been doing that since the start, and now that I'm part of Movement Insights, the team is inheriting it from me. I doubt anyone objects! If a particular classification comes from another team (like the country protection list coming from Security), we will naturally defer to them.

I think the key issue here is that the data should be automatically loaded to the Data Lake, so some team needs to own that process. I'll update the task to reflect that.

@WDoranWMF @VirginiaPoundstone, could Data Products take it on? @Antoine_Quhen has already written most (all?) of the code to do it with Airflow, so the responsibility should be pretty limited.

nshahquinn-wmf renamed this task from Canonical-data ownership, definition and update to Automate the loading of canonical data tables to the Data Lake.Dec 9 2023, 2:40 AM

nshahquinn-wmf updated the task description. (Show Details)

nshahquinn-wmf added a project: Analytics-Canonical-Data.Feb 22 2024, 10:19 PM

OSefu-WMF mentioned this in T369207: Canonical Data | Update, Maintain, Improve.Jul 3 2024, 6:28 PM

OSefu-WMF added a parent task: T369207: Canonical Data | Update, Maintain, Improve.

Automate the loading of canonical data tables to the Data LakeOpen, MediumPublicActions

Description

Related ObjectsSearch...

Event Timeline

Automate the loading of canonical data tables to the Data Lake
Open, MediumPublic
Actions

Related Objects
Search...