NOTE: This work will need to be done in collaboration with the Data Platform Team, as it's their Generated Data Platform we'll be using
Now that we're pretty sure that pushing wikidata information into the weighted_tags field in the commons index improves image search on an experimental index, we need to do the same for the production commonswiki_file index
At the same time we also need to gather up all data relevant to image suggestions, and push it to various persistence layers for consumption by clients
Part 1
- Gather relevant data from wikidata for commons files
- Our original notebook that gathers the necessary data and writes it to a parquet file is here https://github.com/cormacparle/commons_wikidata_links/blob/main/gather_data.ipynb
- Subtask T299408 covers gathering additional data
- Subtask T300045 refactors the notebook and places it in the Generated Data Platform scaffolding
- Subtask T302095 makes it compliant with Search's update process
Push the data into the commonswiki_file search index
Part 2
- Gather list of unillustrated articles with their suggestions
Part 3
- Push suggestions flags to individual search indices
Part 4
- Push unillustrated articles with their suggestions, suggestion reasons and confidence scores to Cassandra
Part 5
- Orchestrate all those scripts in airflow - write an airflow job to orchestrate them that runs every week