Change Details

The PET would like to have one data source for the image-suggestion-api, rather than the two we have at the minute. Our options are to incorporate the search signal data into the output from research's data pipeline, or to incorporate research's data into the search indices on commons This is a proposal for how we might do the latter, just for the purposes of discussion ATM the Image Matching Algorithm returns as suggested images for an article 1. the image for the wikidata item corresponding to the article 2. images from the commons category associated with the wikidata item corresponding to the article 3. lead images from articles in other wikis that correspond to the same wikidata item I propose that we inject the above three pieces of data as property-value pairs into the `statement_keywords` field: * `isImageForItem` can store the wikidata item for which this image is the primary image * `isInCommonsCategory` can store the wikidata item for the commons category that the image belongs to * `isLeadImageFor` can store the wikidata item url of the wiki page this image is a lead image on So for example, [[ https://commons.wikimedia.org/wiki/File:Mona_Lisa,_by_Leonardo_da_Vinci,_from_C2RMF_retouched.jpg | this image of the mona lisa ]] will gain the following elements in its `statement_keywords` array ``` isImageForItem=Q12418 isInCommonsCategory=Q13369744 isLeadImageFor=Q12418 ``` Proof of concept: --- To do this fairly quickly and see if it works, I'd suggest following approach 1. Inject the data into the commons index initially via a data pipeline - simply query a wikidata snapshot and write the data directly to the commons search index 2. Incorporate searching for the new data `Search/ASTQueryBuiler/WordsQueryNodeHandler.php` or its successor, using a new search profile 3. Tune the new search profile based on the labeled data we originally gathered for `media-search-signal-test` 4. Compare the precision/recall/accuracy of searches for the labeled data we gathered for `image-recommendation-test` with the data returned from the image recommendation api 5. if it's better or not worse then consider what steps we might take to make this sustainable in the longer-term (including updating the image-recommendation api to use mediasearch as its single data source, possibly creating new wikidata properties for `isImageForItem` etc, and replacing the data pipeline with an event-driven model), or otherwise clean the new data out of the index

The PET would like to have one data source for the image-suggestion-api, rather than the two we have at the minute. Our options are to incorporate the search signal data into the output from research's data pipeline, or to incorporate research's data into the search indices on commons This is a proposal for an experiment to investigate if doing the latter will give us acceptable results for image-recommendations ATM the Image Matching Algorithm returns as suggested images for an article 1. the image for the wikidata item corresponding to the article 2. images from the commons category associated with the wikidata item corresponding to the article 3. lead images from articles in other wikis that correspond to the same wikidata item I propose that we inject the above three pieces of data as property-value pairs into the `weighted_fields` field: * `isImageForItem` can store the wikidata item for which this image is the primary image * `isInCommonsCategory` can store the wikidata item for the commons category that the image belongs to * `isLeadImageFor` can store the wikidata item url of the wiki page this image is a lead image on So for example, [[ https://commons.wikimedia.org/wiki/File:Mona_Lisa,_by_Leonardo_da_Vinci,_from_C2RMF_retouched.jpg | this image of the mona lisa ]] will gain the following elements in its `weighted_fields` array ``` wikidata.isImageForItem/Q12418 wikidata.isInCommonsCategory/Q13369744 wikidata.isLeadImageFor/Q12418 ``` The experiment: --- 1. Create a new search index (on relforge) based on a commons dump, and enhanced with the data described above 2. Incorporate searching for the new data `Search/ASTQueryBuiler/WordsQueryNodeHandler.php` or its successor, using a new search profile 3. Tune the new search profile based on a training dataset extracted from our labeled data 4. Calculate a balanced accuracy score for the labeled data we already have from the existing image-recommendations API via [[ https://image-recommendation-test.toolforge.org/ | image-recommendation-test ]] 5. Run equivalent searches on the new search index, and calculate the balanced accuracy score for that 6. Iterate on tuning if necessary Expected outcome: --- If the balanced accuracy score using the new search index is better or at least no worse than from the existing image-recommendations API, then we can make a case for injecting the new data into the production commons indices (probably using events), and using the new search profile in production

The PET would like to have one data source for the image-suggestion-api, rather than the two we have at the minute. Our options are to incorporate the search signal data into the output from research's data pipeline, or to incorporate research's data into the search indices on commons This is a proposal for how we might doan experiment to investigate if doing the latter, just for the purposes of discussion will give us acceptable results for image-recommendations ATM the Image Matching Algorithm returns as suggested images for an article 1. the image for the wikidata item corresponding to the article 2. images from the commons category associated with the wikidata item corresponding to the article 3. lead images from articles in other wikis that correspond to the same wikidata item I propose that we inject the above three pieces of data as property-value pairs into the `statement_keywor`weighted_fields` field: * `isImageForItem` can store the wikidata item for which this image is the primary image * `isInCommonsCategory` can store the wikidata item for the commons category that the image belongs to * `isLeadImageFor` can store the wikidata item url of the wiki page this image is a lead image on So for example, [[ https://commons.wikimedia.org/wiki/File:Mona_Lisa,_by_Leonardo_da_Vinci,_from_C2RMF_retouched.jpg | this image of the mona lisa ]] will gain the following elements in its `statement_keywor`weighted_fields` array ``` wikidata.isImageForItem=/Q12418 wikidata.isInCommonsCategory=/Q13369744 wikidata.isLeadImageFor=/Q12418 ``` Proof of concept:The experiment: --- To do this fairly quickly and see if it works,1. I'd suggest following approach 1.Create a new search index (on relforge) based on a commons dump, Inject the data into the commons index initially via a data pipeline - simply query a wikidata snapshot and write the data directly to the commons search indexand enhanced with the data described above 2. Incorporate searching for the new data `Search/ASTQueryBuiler/WordsQueryNodeHandler.php` or its successor, using a new search profile 3. Tune the new search profile based on a training dataset extracted from our labeled data 4. Calculate a balanced accuracy score for the labeled data we originally gathered for `media-search-signal-test`already have from the existing image-recommendations API via [[ https://image-recommendation-test.toolforge.org/ | image-recommendation-test ]] 4.5. Run equivalent searches on the new search index, Compare the precision/recall/accuracy of searches for the labeled data we gathered for `image-recommendation-test` with the data returned from the image recommendation apiand calculate the balanced accuracy score for that 56. if it's better or not worse then consider what steps we might take to make this sustainable in the longer-term (including updating the image-recommendation api to use mediasearch as its single data source, possibly creating new wikidata properties for `isImageForItem` etcIterate on tuning if necessary Expected outcome: --- If the balanced accuracy score using the new search index is better or at least no worse than from the existing image-recommendations API, and replacthen we can make a case for injecting thee new data pipeline with an event-driven modelinto the production commons indices (probably using events), or otherwise cleanand using the new data out of thsearch profile index production