NOTE: This work will need to be done in collaboration with the Data Platform team, as it's their airflow platform we'll be using
Now that we're pretty sure that pushing wikidata information into the `weighted_tags` field in the commons index improves image search on an experimental index, we need to do the same for the production commonswiki_file index
At the same time we also need to gather up all data relevant to image suggestions, and push it to various persistence layers for consumption by clients
The various data gathering and pushing tasks have their own tickets - this ticket is to write an airflow job to orchestrate them that runs **every week**. The steps are as follows:
Part 1
--
* Populate commons search index with relevant data from wikidata
** The notebook that gathers the necessary data and writes it to a parquet file is here https://github.com/cormacparle/commons_wikidata_links/blob/main/gather_data.ipynb
** Subtask T299408 covers gathering additional data and writing it also to the parquet
* The script to push the data in the parquet into the commons search index is here https://github.com/cormacparle/commons_wikidata_links/blob/main/push_data_to_elastic.py
** It will need to be modified so as to delete data pushed by the last time the job was run. Covered by subtask T299787
Part 2
--
* Gather list of unillustrated articles with their suggestions
** T299789
Part 3
--
* Push suggestions flags to individual search indices
** T299884
Part 4
--
* Push unillustrated articles with their suggestions, suggestion reasons and confidence scores to Cassandra
** T299885