Use case
As a developer I need to transform the original notebook used to gather wikidata data relevant to commons images to a python file, so that it can be run via airflow
The notebook https://github.com/cormacparle/commons_wikidata_links/blob/main/gather_data.ipynb gathers wikidata data relevant to commons images and writes it to a parquet file.
We need to transform the script to a python file so it can be run via airflow as follows
- Transform the jupyter notebook to a script as shown here Airflow_Coding_Convention#Jupyter_Notebooks
- Remove use of wmfdata and use Spark directly: Airflow_Coding_Convention#Spark
- Instead of reading from hdfs:/user/gmodena/image_placeholders call the script that generates the placeholder image for the airflow user in the data platform team's Airflow Dag before running the notebook script. This would change the code to simply read from image_placeholders