Page MenuHomePhabricator

Automate the loading of canonical data tables to the Data Lake
Open, MediumPublic

Description

The canonical data tables have been widely adopted by data engineers, research, and analysts.

The current workflow for changes is that first the TSV file in the Git repository is updated, and then a user with the appropriate permissions manually runs a Jupyter notebook that uploads the contents to the TSV to the Data Lake table.

This is unnecessary toil and creates the potential for subtle issues if a change is merged to the repo but not loaded to the Hive table, which is almost always what is used (including by automated jobs).

@Antoine_Quhen has already drafted an Airflow pipeline for loading the countries dataset, and this could easily be expanded to cover the wiki dataset as well:

Some optional ideas:

  • Switch the pyspark script to an HQL script (e.g. with a tmp view)
CREATE TEMPORARY VIEW tmp_countries USING csv OPTIONS (path 'file:///countries.tsv', header true);
  • Format the table as Iceberg (to keep the history of the table, including schema & data)
  • Consider whether the loading can be done as a post-commit hook (rather than at a fixed frequency)

Event Timeline

JAnstee_WMF lowered the priority of this task from High to Medium.

For now, I don't think there are any serious questions about stewardship of the content of the canonical data tables. I've been doing that since the start, and now that I'm part of Movement Insights, the team is inheriting it from me. I doubt anyone objects! If a particular classification comes from another team (like the country protection list coming from Security), we will naturally defer to them.

I think the key issue here is that the data should be automatically loaded to the Data Lake, so some team needs to own that process. I'll update the task to reflect that.

@WDoranWMF @VirginiaPoundstone, could Data Products take it on? @Antoine_Quhen has already written most (all?) of the code to do it with Airflow, so the responsibility should be pretty limited.

nshahquinn-wmf renamed this task from Canonical-data ownership, definition and update to Automate the loading of canonical data tables to the Data Lake.Dec 9 2023, 2:40 AM
nshahquinn-wmf updated the task description. (Show Details)