NOTE: This work will need to be done in collaboration with the Data Platform team, as it's their airflow platform we'll be using
User story
---
As a developer, I need to schedule and orchestrate the various data gathering and storing tasks for image suggestions, so that image suggestions data is available to users.
---
Now that we're pretty sure that pushing wikidata information into the `weighted_tags` field in the commons index improves image search on an experimental index, we need to do the same for the production commonswiki_file index
At the same time we also need to gather up all data relevant to image suggestions, and push it to various persistence layers for consumption by clients
The various data gathering and pushing tasks have their own tickets - this ticket is to write an airflow job to orchestrate them that runs **every week**. The steps are as follows:
Part 1
--
* Gather relevant data from wikidata for commons files
** Our original notebook that gathers the necessary data and writes it to a parquet file is here https://github.com/cormacparle/commons_wikidata_links/blob/main/gather_data.ipynb
** Subtask T300045 covers transforming it so it can be run by airflow
** Subtask T299408 covers gathering additional data
* Push the data into the `commonswiki_file` search index
** {T299787}
Part 2
--
* Gather list of unillustrated articles with their suggestions
** {T299789}
Part 3
--
* Push suggestions flags to individual search indices
** {T299884}
Part 4
--
* Push unillustrated articles with their suggestions, suggestion reasons and confidence scores to Cassandra
** {T299885}
Part 5
--
* Orchestrate all those scripts in airflow - write an airflow job to orchestrate them that runs **every week**
** This ticket