The canonical data tables have been widely adopted by data engineers, research, and analysts.
The current workflow for changes is that first the TSV file in the Git repository is updated, and then a user with the appropriate permissions manually runs a Jupyter notebook that uploads the contents to the TSV to the Data Lake table.
This is unnecessary toil and creates the potential for subtle issues if a change is merged to the repo but not loaded to the Hive table, which is almost always what is used (including by automated jobs).
@Antoine_Quhen has already drafted an Airflow pipeline for loading the countries dataset, and this could easily be expanded to cover the wiki dataset as well:
- https://github.com/wikimedia-research/canonical-data/pull/7
- https://gitlab.wikimedia.org/repos/data-engineering/airflow-dags/-/merge_requests/438
Some optional ideas:
- Switch the pyspark script to an HQL script (e.g. with a tmp view)
CREATE TEMPORARY VIEW tmp_countries USING csv OPTIONS (path 'file:///countries.tsv', header true);
- Format the table as Iceberg (to keep the history of the table, including schema & data)
- Consider whether the loading can be done as a post-commit hook (rather than at a fixed frequency)