The PET would like to have one data source for the image-suggestion-api, rather than the two we have at the minute. Our options are to incorporate the search signal data into the output from research's data pipeline, or to incorporate research's data into the search indices on commons
This is a proposal for how we might do the latter, just for the purposes of discussion
ATM the Image Matching Algorithm returns as suggested images for an article
1. the image for the wikidata item corresponding to the article
2. images from the commons category associated with the wikidata item corresponding to the article
3. lead images from articles in other wikis that correspond to the same wikidata item
1 is given a confidence score of "high", 2 and 3 are given a confidence score of "medium". According to our (limited) user testing, a score of "high" corresponds to an 85% likelihood that the recommendation is good, while a score of "medium" corresponds to a 58% likelihoodI propose that we inject the above three pieces of data as property-value pairs into the `statement_keywords` field:
* `isImageForItem` can store the wikidata item for which this image is the primary image
* `isInCommonsCategory` can store the wikidata item for the commons category that the image belongs to
* `isLeadImageFor` can store the wikidata item url of the wiki page this image is a lead image on
I propose that we inject the above three pieces of data into the `statement_keywords` field.So for example, This can happen initially via a data pipeline, which ultimately could be replaced by events being fired when the wikidata item is edited.[[ https://commons.wikimedia.org/wiki/File:Mona_Lisa,_by_Leonardo_da_Vinci,_from_C2RMF_retouched.jpg | this image of the mona lisa ]] will gain the following elements in its `statement_keywords` array
We can ask the community to create new wikidata properties `isImageForItem`, `isInCommonsCategory` and `isLeadImageFor`, each taking a wikidata Q-item as a value **OR** we can simply use our own set of ids - as these are generated rather than user-editable they don't actually need to be stored in wikidata.
We'll store these in the commons index as key-value pairs in `statement_keywords`, and then the SD team (or whoever is writing mediasearch query code at that stage) can incorporate them into `Search/ASTQueryBuiler/WordsQueryNodeHandler.php` or its successor```
isImageForItem=Q12418
isInCommonsCategory=Q13369744
isLeadImageFor=Q12418
```
When it's running we can tune the search scores based on the training data we have collected,Proof of concept:
---
To do this fairly quickly and see if it works, I'd suggest following approach
1. Inject the data into the commons index initially via a data pipeline - simply query the snapshots and write the data directly to the search index
2. Incorporate searching for the new data `Search/ASTQueryBuiler/WordsQueryNodeHandler.php` or its successor, using a new search profile
3. and ought to run another set of user testing forTune the new search profile based on the image-suggestion-api **and** some tests on image labeled data we originally gathered for `media-search itself,-signal-test`
4. to make sure we haven't made either worseCompare the precision/recall/accuracy of searches for the labeled data we gathered for `image-recommendation-test` with the data returned from the image recommendation api
5. if it's better or not worse then consider what steps we might take to make this sustainable in the longer-term (including updating the image-recommendation api to use mediasearch as its single data source, possibly creating new wikidata properties for `isImageForItem` etc, and replacing the data pipeline with an event-driven model), or otherwise clean the new data out of the index