The PET would like to have one data source for the image-suggestion-api, rather than the two we have at the minute. Our options are to incorporate the search signal data into the output from research's data pipeline, or to incorporate research's data into the search indices on commons
This is a proposal for an experiment to investigate if doing the latter will give us acceptable results for image-recommendations
ATM the Image Matching Algorithm returns as suggested images for an article
- the image for the wikidata item corresponding to the article
- images from the commons category associated with the wikidata item corresponding to the article
- lead images from articles in other wikis that correspond to the same wikidata item
I propose that we inject the above three pieces of data as property-value pairs into the weighted tags field:
- image.linked.from.wikidata.P18 can store any wikidata item for which this image is the primary image
- image.linked.from.wikidata.P373 can store the wikidata item for any commons category that the image belongs to
- image.linked.from.wikidata.sitelink can store wikidata item ids for any wiki page this image is included on
So for example, this image of the mona lisa will gain the following elements in its weighted_tags array
image.linked.from.wikidata.P18/Q12418 image.linked.from.wikidata.P18/Q13369744 image.linked.from.wikidata.sitelink/Q12418
The experiment:
- Create a new search index (on relforge) based on a commons dump, and enhanced with the data described above T286562
- Incorporate searching for the new data Search/ASTQueryBuiler/WordsQueryNodeHandler.php or its successor, using a new search profile T286563
- Tune the new search profile based on a training dataset extracted from our labeled data
- Calculate a balanced accuracy score for the labeled data we already have from the existing image-recommendations API via image-recommendation-test
- Run equivalent searches on the new search index, and calculate the balanced accuracy score for that
- Iterate on tuning if necessary
Expected outcome:
If the balanced accuracy score using the new search index is better or at least no worse than from the existing image-recommendations API, then we can make a case for injecting the new data into the production commons indices (probably using events), and using the new search profile in production