Page MenuHomePhabricator

Automatically update the canonical data tables
Open, MediumPublic

Description

Current state

The canonical data tables have been widely adopted by data engineers, research, and analysts.

The current workflow for changes is that first the TSV file in the Git repository is updated, and then a user with the appropriate permissions manually runs a Jupyter notebook that uploads the contents to the TSV to the Data Lake table.

This is unnecessary toil and creates the potential for subtle issues if a change is merged to the repo but not loaded to the Hive table, which is almost always what is used (including by automated jobs).

Proposal

First, we should convert wiki/generate.ipynb to a pure Python script rather than a Jupyter notebook. I've already done one piece of preparatory work: logging a warning rather than having space for a manual corrections for null language names. The remaining work is:

  • Switching to warning when data from the sites table is null (sometimes the table doesn't get populated for a long time a wiki is created, e.g. T381379#10427762—I don't know if there's another, more-reliable place to source the data from)
  • Switching to warning when the English site name is null and upstreaming the existing 10 manual values
  • Converting to a Python file (very easy)

Then, we should write an Airflow job that periodically:

  1. Re-runs wiki/generate.ipynb to re-generate the dataset. We should make sure that the jobs surfaces any emitted warnings (I don't think this will happen by default).
  2. Loads wikis.tsv and countries.tsv to the Data Lake.

The one weird point is how the Airflow job should interact with the Git repo. There's an important difference between the two datasets. For the country dataset, the TSV file is the source of truth; if you want to change it, you change the TSV file directly using whatever tool you like. So it's important that the country TSV lives in the repo so it can be version controlled.

However, for the wiki dataset, the generation script is the source of truth; if you change the TSV, it will get overwritten by the script when it next runs. So we either need to remove the TSV from the repo and find another place to host a static copy online, or have the Airflow job merge the updated TSV to the repo every time there's a change. Not sure how hard that would be.

At one point @Antoine_Quhen made a draft Airflow pipeline for loading the countries dataset, which could be a good starting point:

Some optional ideas:

  • Switch the pyspark script to an HQL script (e.g. with a tmp view)
CREATE TEMPORARY VIEW tmp_countries USING csv OPTIONS (path 'file:///countries.tsv', header true);
  • Format the table as Iceberg (to keep the history of the table, including schema & data)
  • Consider whether the loading can be done as a post-commit hook (rather than at a fixed frequency)

Event Timeline

JAnstee_WMF lowered the priority of this task from High to Medium.

For now, I don't think there are any serious questions about stewardship of the content of the canonical data tables. I've been doing that since the start, and now that I'm part of Movement Insights, the team is inheriting it from me. I doubt anyone objects! If a particular classification comes from another team (like the country protection list coming from Security), we will naturally defer to them.

I think the key issue here is that the data should be automatically loaded to the Data Lake, so some team needs to own that process. I'll update the task to reflect that.

@WDoranWMF @VirginiaPoundstone, could Data Products take it on? @Antoine_Quhen has already written most (all?) of the code to do it with Airflow, so the responsibility should be pretty limited.

nshahquinn-wmf renamed this task from Canonical-data ownership, definition and update to Automate the loading of canonical data tables to the Data Lake.Dec 9 2023, 2:40 AM
nshahquinn-wmf updated the task description. (Show Details)

FYI: this icebox task: T241741 was trying to do the same thing quite a while ago. I think this sore point seems minor but is quite central and has caused multiple problems

Should we close one of these tasks as a duplicate?

nshahquinn-wmf renamed this task from Automate the loading of canonical data tables to the Data Lake to Automatically update the canonical data tables.Mar 6 2025, 12:18 AM
nshahquinn-wmf updated the task description. (Show Details)

I've updated the description so it has a pretty full description about how I would approach doing this (as well as a few issues that a likely to be a bit tricky).

Hm, I wonder if some of the work here (interacting with git repos using airflow jobs) could be shared with {T380626}.

For wikis.tsv, what about moving that out of the GitLab repo and into analytics.wikimedia.org/published/ and then having an Airflow job generate the wikis table for the data lake and then publish it to web?

For wikis.tsv, what about moving that out of the GitLab repo and into analytics.wikimedia.org/published/ and then having an Airflow job generate the wikis table for the data lake and then publish it to web?

That's a good idea! Would certainly be easier than trying to get an Airflow job to push to the Git repo. We'd lose the ability to see historical changes in the dataset, but I think we could accept that.

Do you know of any existing Airflow jobs that publish files there?

May I recommend splitting this into two sub-tasks? One for countries, one for wikis.