User story
As a user I want image search to give me results that are as good as possible, and also to be able to retrieve good suggestions for images for unillustrated articles. At the moment data on whether an image is used as as a lead image in a wikipedia article is not stored in the commons search index, and we believe both search and image suggestions can be improved using this information. This ticket adds the information to the dataset which is subsequently used to add wikidata data to the commons search index.
See https://github.com/cormacparle/commons_wikidata_links for the current jupyter notebook for gathering wikidata data so we can use it to populate weighted_tags in the commons search index
Add a field to the output data that is written to the parquet file which contains the wikidata items of any wiki article the image is the lead image for
- e.g. if Image_X is the lead image on https://ga.wikipedia.org/Page_Y - i.e. Image_X's title is the value of either the 'page_image_free' page_prop or the 'page_image' page_prop for https://ga.wikipedia.org/Page_Y
- AND https://ga.wikipedia.org/Page_Y has a corresponding wikidata id Q12345
- then for Image_X we'll set the field image.linked.from.wikidata.lead_image/Q12345|<score>
- <score> will be an integer between 0 and 1000, proportional to the importance of all pages with wikidata id Q12345 across all wikis (using incoming links via the pagelinks table to give a measure of "importance")
The extra search data should not be added to any image that is excluded by the current Image Suggestions Algorithm (this is covered already in the existing notebook code, just making a note of it here for completeness)