The PET would like to have one data source for the image-suggestion-api, rather than the two we have at the minute. Our options are to incorporate the search signal data into the output from research's data pipeline, or to incorporate research's data into the search indices on commons
This is a proposal for how we might doan experiment to investigate if doing the latter, just for the purposes of discussion will give us acceptable results for image-recommendations
ATM the Image Matching Algorithm returns as suggested images for an article
1. the image for the wikidata item corresponding to the article
2. images from the commons category associated with the wikidata item corresponding to the article
3. lead images from articles in other wikis that correspond to the same wikidata item
I propose that we inject the above three pieces of data as property-value pairs into the `statement_keywor`weighted_fields` field:
* `isImageForItem` can store the wikidata item for which this image is the primary image
* `isInCommonsCategory` can store the wikidata item for the commons category that the image belongs to
* `isLeadImageFor` can store the wikidata item url of the wiki page this image is a lead image on
So for example, [[ https://commons.wikimedia.org/wiki/File:Mona_Lisa,_by_Leonardo_da_Vinci,_from_C2RMF_retouched.jpg | this image of the mona lisa ]] will gain the following elements in its `statement_keywor`weighted_fields` array
```
wikidata.isImageForItem=/Q12418
wikidata.isInCommonsCategory=/Q13369744
wikidata.isLeadImageFor=/Q12418
```
Proof of concept:The experiment:
---
To do this fairly quickly and see if it works,1. I'd suggest following approach
1.Create a new search index (on relforge) based on a commons dump, Inject the data into the commons index initially via a data pipeline - simply query a wikidata snapshot and write the data directly to the commons search indexand enhanced with the data described above
2. Incorporate searching for the new data `Search/ASTQueryBuiler/WordsQueryNodeHandler.php` or its successor, using a new search profile
3. Tune the new search profile based on a training dataset extracted from our labeled data
4. Calculate a balanced accuracy score for the labeled data we originally gathered for `media-search-signal-test`already have from the existing image-recommendations API via [[ https://image-recommendation-test.toolforge.org/ | image-recommendation-test ]]
4.5. Run equivalent searches on the new search index, Compare the precision/recall/accuracy of searches for the labeled data we gathered for `image-recommendation-test` with the data returned from the image recommendation apiand calculate the balanced accuracy score for that
56. if it's better or not worse then consider what steps we might take to make this sustainable in the longer-term (including updating the image-recommendation api to use mediasearch as its single data source, possibly creating new wikidata properties for `isImageForItem` etcIterate on tuning if necessary
Expected outcome:
---
If the balanced accuracy score using the new search index is better or at least no worse than from the existing image-recommendations API, and replacthen we can make a case for injecting thee new data pipeline with an event-driven modelinto the production commons indices (probably using events), or otherwise cleanand using the new data out of thsearch profile index production