Current state
The canonical data tables have been widely adopted by data engineers, research, and analysts.
The current workflow for changes is that first the TSV file in the Git repository is updated, and then a user with the appropriate permissions manually runs a Jupyter notebook that uploads the contents to the TSV to the Data Lake table.
This is unnecessary toil and creates the potential for subtle issues if a change is merged to the repo but not loaded to the Hive table, which is almost always what is used (including by automated jobs).
Proposal
First, we should convert wiki/generate.ipynb to a pure Python script rather than a Jupyter notebook. I've already done one piece of preparatory work: logging a warning rather than having space for a manual corrections for null language names. The remaining work is:
- Switching to warning when data from the sites table is null (sometimes the table doesn't get populated for a long time a wiki is created, e.g. T381379#10427762—I don't know if there's another, more-reliable place to source the data from)
- Switching to warning when the English site name is null and upstreaming the existing 10 manual values
- Converting to a Python file (very easy)
Then, we should write an Airflow job that periodically:
- Re-runs wiki/generate.ipynb to re-generate the dataset. We should make sure that the jobs surfaces any emitted warnings (I don't think this will happen by default).
- Loads wikis.tsv and countries.tsv to the Data Lake.
The one weird point is how the Airflow job should interact with the Git repo. There's an important difference between the two datasets. For the country dataset, the TSV file is the source of truth; if you want to change it, you change the TSV file directly using whatever tool you like. So it's important that the country TSV lives in the repo so it can be version controlled.
However, for the wiki dataset, the generation script is the source of truth; if you change the TSV, it will get overwritten by the script when it next runs. So we either need to remove the TSV from the repo and find another place to host a static copy online, or have the Airflow job merge the updated TSV to the repo every time there's a change. Not sure how hard that would be.
At one point @Antoine_Quhen made a draft Airflow pipeline for loading the countries dataset, which could be a good starting point:
- https://github.com/wikimedia-research/canonical-data/pull/7
- https://gitlab.wikimedia.org/repos/data-engineering/airflow-dags/-/merge_requests/438
Some optional ideas:
- Switch the pyspark script to an HQL script (e.g. with a tmp view)
CREATE TEMPORARY VIEW tmp_countries USING csv OPTIONS (path 'file:///countries.tsv', header true);
- Format the table as Iceberg (to keep the history of the table, including schema & data)
- Consider whether the loading can be done as a post-commit hook (rather than at a fixed frequency)