User story
---
As a user of Image Suggestions I want to be able to get an image suggestion with a confidence score for a article with a particular wikidata id. To make this possible we need to provide linkages between wikidata ids and the commons files relevant to them in the `commonswiki_file` search index.
---
https://github.com/cormacparle/commons_wikidata_links/blob/main/gather_data.ipynb gathers wikidata items relevant for commons search, and stores it
https://github.com/cormacparle/commons_wikidata_links/blob/main/push_data_to_elastic.py is for pushing it to elasticsearch, but it was written as a once-off, and doesn't deal with updating existing data
This ticket is to replicate what `push_data_to_elastic.py` does, but instead of updating elastic via http we want to write the data to a Hive table and then use Search's airflow process to update the `commonswiki_file` index
Two other things to note while working on this ticket:
* assuming the wikidata collection code has been modified as described in notebook is modified as described in T300045 then we can safely remove the hardcoding of the user info here 'hdfs:/user/cparle/commons_files_related_wikidata_items' to simply 'commons_files_related_wikidata_items' as the data will live under the platform Airflow username
* the script is using spark.read.load() to pull down data locally and run computations which the data platform team strongly advises against. Use spark's user defined functions instead [[ https://www.mediawiki.org/wiki/User:CAndrew_(WMF)/Airflow_Coding_Convention#Computation | Airflow_Coding_Convention#Computation ]]