Use case
---
As a developer I need to transform the original notebook used to gather wikidata data relevant to commons images to a python file, so that it can be run via airflow
---
The notebook https://github.com/cormacparle/commons_wikidata_links/blob/main/gather_data.ipynb gathers wikidata data relevant to commons images and writes it to a parquet file.
We need to transform the script to a python file so it can be run via airflow as follows
* Transform the jupyter notebook to a script as shown here [[ https://www.mediawiki.org/wiki/User:CAndrew_(WMF)/Airflow_Coding_Convention#Jupyter_Notebooks | Airflow_Coding_Convention#Jupyter_Notebooks ]]
* Remove use of wmfdata and use Spark directly: [[ https://www.mediawiki.org/wiki/User:CAndrew_(WMF)/Airflow_Coding_Convention#Spark | Airflow_Coding_Convention#Spark ]]
* Instead of reading from hdfs:/user/gmodena/image_placeholders call the script that generates the placeholder image for the airflow user in the data platform team's [[ https://gitlab.wikimedia.org/gmodena/platform-airflow-dags/-/blob/multi-project-dags-repo/dags/ima.py#L83-L86 | Airflow Dag ]] before running the notebook script. This would change the code to simply read from image_placeholders