Change Details

The PET would like to have one data source for the image-suggestion-api, rather than the two we have at the minute. Our options are to incorporate the search signal data into the output from research's data pipeline, or to incorporate research's data into the search indices on commons This is a proposal for an experiment to investigate if doing the latter will give us acceptable results for image-recommendations ATM the Image Matching Algorithm returns as suggested images for an article 1. the image for the wikidata item corresponding to the article 2. images from the commons category associated with the wikidata item corresponding to the article 3. lead images from articles in other wikis that correspond to the same wikidata item I propose that we inject the above three pieces of data as property-value pairs into the [[https://wikitech.wikimedia.org/wiki/Search/WeightedTags|weighted tags]] field: * `isImageForItem` can store the wikidata item for which this image is the primary image * `isInCommonsCategory` can store the wikidata item for the commons category that the image belongs to * `isLeadImageFor` can store the wikidata item url of the wiki page this image is a lead image on So for example, [[ https://commons.wikimedia.org/wiki/File:Mona_Lisa,_by_Leonardo_da_Vinci,_from_C2RMF_retouched.jpg | this image of the mona lisa ]] will gain the following elements in its `weighted_tags` array ``` wikidata.isImageForItem/Q12418 wikidata.isInCommonsCategory/Q13369744 wikidata.isLeadImageFor/Q12418 ``` The experiment: --- 1. Create a new search index (on relforge) based on a commons dump, and enhanced with the data described above T286562 2. Incorporate searching for the new data `Search/ASTQueryBuiler/WordsQueryNodeHandler.php` or its successor, using a new search profile T286563 3. Tune the new search profile based on a training dataset extracted from our labeled data 4. Calculate a balanced accuracy score for the labeled data we already have from the existing image-recommendations API via [[ https://image-recommendation-test.toolforge.org/ | image-recommendation-test ]] 5. Run equivalent searches on the new search index, and calculate the balanced accuracy score for that 6. Iterate on tuning if necessary Expected outcome: --- If the balanced accuracy score using the new search index is better or at least no worse than from the existing image-recommendations API, then we can make a case for injecting the new data into the production commons indices (probably using events), and using the new search profile in production

The PET would like to have one data source for the image-suggestion-api, rather than the two we have at the minute. Our options are to incorporate the search signal data into the output from research's data pipeline, or to incorporate research's data into the search indices on commons This is a proposal for an experiment to investigate if doing the latter will give us acceptable results for image-recommendations ATM the Image Matching Algorithm returns as suggested images for an article 1. the image for the wikidata item corresponding to the article 2. images from the commons category associated with the wikidata item corresponding to the article 3. lead images from articles in other wikis that correspond to the same wikidata item I propose that we inject the above three pieces of data as property-value pairs into the [[https://wikitech.wikimedia.org/wiki/Search/WeightedTags|weighted tags]] field: * `image.linked.from.wikidata.P18` can store any wikidata item for which this image is the primary image * `image.linked.from.wikidata.P373` can store the wikidata item for any commons category that the image belongs to * `image.linked.from.wikidata.sitelink` can store wikidata item ids for any wiki page this image is included on So for example, [[ https://commons.wikimedia.org/wiki/File:Mona_Lisa,_by_Leonardo_da_Vinci,_from_C2RMF_retouched.jpg | this image of the mona lisa ]] will gain the following elements in its `weighted_tags` array ``` image.linked.from.wikidata.P18/Q12418 image.linked.from.wikidata.P18/Q13369744 image.linked.from.wikidata.sitelink/Q12418 ``` The experiment: --- 1. Create a new search index (on relforge) based on a commons dump, and enhanced with the data described above T286562 2. Incorporate searching for the new data `Search/ASTQueryBuiler/WordsQueryNodeHandler.php` or its successor, using a new search profile T286563 3. Tune the new search profile based on a training dataset extracted from our labeled data 4. Calculate a balanced accuracy score for the labeled data we already have from the existing image-recommendations API via [[ https://image-recommendation-test.toolforge.org/ | image-recommendation-test ]] 5. Run equivalent searches on the new search index, and calculate the balanced accuracy score for that 6. Iterate on tuning if necessary Expected outcome: --- If the balanced accuracy score using the new search index is better or at least no worse than from the existing image-recommendations API, then we can make a case for injecting the new data into the production commons indices (probably using events), and using the new search profile in production

The PET would like to have one data source for the image-suggestion-api, rather than the two we have at the minute. Our options are to incorporate the search signal data into the output from research's data pipeline, or to incorporate research's data into the search indices on commons This is a proposal for an experiment to investigate if doing the latter will give us acceptable results for image-recommendations ATM the Image Matching Algorithm returns as suggested images for an article 1. the image for the wikidata item corresponding to the article 2. images from the commons category associated with the wikidata item corresponding to the article 3. lead images from articles in other wikis that correspond to the same wikidata item I propose that we inject the above three pieces of data as property-value pairs into the [[https://wikitech.wikimedia.org/wiki/Search/WeightedTags|weighted tags]] field: * `isImageForItemmage.linked.from.wikidata.P18` can store theany wikidata item for which this image is the primary image * `isInCommonsCategory`mage.linked.from.wikidata.P373` can store the wikidata item for theany commons category that the image belongs to * `isLeadImageFormage.linked.from.wikidata.sitelink` can store the wikidata item url of theids for any wiki page this image is a lead image onincluded on So for example, [[ https://commons.wikimedia.org/wiki/File:Mona_Lisa,_by_Leonardo_da_Vinci,_from_C2RMF_retouched.jpg | this image of the mona lisa ]] will gain the following elements in its `weighted_tags` array ``` image.linked.from.wikidata.isImageForItemP18/Q12418 image.linked.from.wikidata.isInCommonsCategoryP18/Q13369744 image.linked.from.wikidata.isLeadImageForsitelink/Q12418 ``` The experiment: --- 1. Create a new search index (on relforge) based on a commons dump, and enhanced with the data described above T286562 2. Incorporate searching for the new data `Search/ASTQueryBuiler/WordsQueryNodeHandler.php` or its successor, using a new search profile T286563 3. Tune the new search profile based on a training dataset extracted from our labeled data 4. Calculate a balanced accuracy score for the labeled data we already have from the existing image-recommendations API via [[ https://image-recommendation-test.toolforge.org/ | image-recommendation-test ]] 5. Run equivalent searches on the new search index, and calculate the balanced accuracy score for that 6. Iterate on tuning if necessary Expected outcome: --- If the balanced accuracy score using the new search index is better or at least no worse than from the existing image-recommendations API, then we can make a case for injecting the new data into the production commons indices (probably using events), and using the new search profile in production