Change Details

NOTE: This work will need to be done in collaboration with the [[https://www.mediawiki.org/wiki/Platform_Engineering_Team|Platform Engineering Team]] (PET), as it's their [[https://www.mediawiki.org/wiki/Platform_Engineering_Team/Data_Value_Stream/Data_Pipeline_Onboarding/|Generated Data Platform]] we'll be using Now that we're pretty sure that pushing wikidata information into the `weighted_tags` field in the commons index improves image search on an experimental index, we need to do the same for the production commonswiki_file index At the same time we also need to gather up all data relevant to image suggestions, and push it to various persistence layers for consumption by clients Part 1 -- * Gather relevant data from wikidata for commons files ** Our original notebook that gathers the necessary data and writes it to a parquet file is here https://github.com/cormacparle/commons_wikidata_links/blob/main/gather_data.ipynb ** Subtask T299408 covers gathering additional data ** Subtask T300045 covers transforming it so it can be run by airflowrefactors the notebook and places it in the Generated Data Platform [[https://www.mediawiki.org/wiki/Platform_Engineering_Team/Data_Value_Stream/Data_Pipeline_Onboarding/#Scaffolding|scaffolding]] ** Subtask T302095 makes it compliant with Search's update process * ~~Push the data into the `commonswiki_file` search index~~ ** {T299787} Part 2 -- * Gather list of unillustrated articles with their suggestions ** {T299789} Part 3 -- * Push suggestions flags to individual search indices ** {T299884} Part 4 -- * Push unillustrated articles with their suggestions, suggestion reasons and confidence scores to Cassandra ** {T299885} Part 5 -- * Orchestrate all those scripts in airflow - write an airflow job to orchestrate them that runs **every week** ** {T302434}