Change Details

In order to experiment with integrating Image Matching Algorithm data in the commons search index, we need to create a new index on relforge to experiment with For an example of copying an index from production to relforge [[ https://phabricator.wikimedia.org/P16419 | see here ]] For an example of augmenting a wiki dump with extra data and writing the whole lot to elastic [[ https://gerrit.wikimedia.org/r/plugins/gitiles/wikimedia/discovery/relevanceForge/+/refs/heads/master/other_tools/augmentdump.py | see here ]] The new data that we want in the dump is three new sets of property-value pairs, plus a score, in the weighted_tags field: * `image.linked.from.wikidata.P18` will store wikidata item ids from which the image is linked via the P18 (image) property ** e.g. if the value of the P18 (image) property for wikidata items Q144 and Q38280 is set to **Image_X** ** then for **Image_X** we'll set the fields `image.linked.from.wikidata.P18/Q144` and `image.linked.from.wikidata.P18/Q38280` * `image.linked.from.wikidata.P373` will store ids for any wikidata item that is linked via P373 (commons category) to any commons category that the image belongs to ** e.g. if wikidata item Q144 has its property P373 (commons category) set to `Dogs` ** AND **Image_X** is in the commons category `Dogs` ** score will be set proportional to the **inverse** of the number of images in the category (because a category with fewer images is more specific, and therefore a better signal) ** then for **Image_X** we'll set the field `image.linked.from.wikidata.P373/Q144|<score>` * `image.linked.from.wikidata.sitelink` will store the wikidata items of any wiki article the image is used in ** e.g. if **Image_X** is used on `https://ga.wikipedia.org/Page_Y` ** AND `https://ga.wikipedia.org/Page_Y` has a corresponding wikidata id `Q12345` ** then for **Image_X** we'll set the field `image.linked.from.sitelink/Q12345|<score>` ** score will be set proportional to the importance of all pages with Q12345 across all wikis The extra search data should **not** be added to any image that is excluded by the current Image Suggestions Algorithm, namely: * images in any of the "placeholder images" categories (or their subcategories) on commons * images that are already used on a large number of pages on any wiki (as they are likely to be placeholders) * images whose titles contain strings that indicate they are likely to be placeholders For more exact definitions of the above see [[ https://github.com/mirrys/ImageMatching/blob/main/algorithm.ipynb | the Image Suggestions Algorithm code ]]

In order to experiment with integrating Image Matching Algorithm data in the commons search index, we need to create a new index on relforge to experiment with For an example of copying an index from production to relforge [[ https://phabricator.wikimedia.org/P16419 | see here ]] For an example of augmenting a wiki dump with extra data and writing the whole lot to elastic [[ https://gerrit.wikimedia.org/r/plugins/gitiles/wikimedia/discovery/relevanceForge/+/refs/heads/master/other_tools/augmentdump.py | see here ]] The new data that we want in the dump is three new sets of property-value pairs, plus a score, in the weighted_tags field: * `image.linked.from.wikidata.P18` will store wikidata item ids from which the image is linked via the P18 (image) property ** e.g. if the value of the P18 (image) property for wikidata items Q144 and Q38280 is set to **Image_X** ** then for **Image_X** we'll set the fields `image.linked.from.wikidata.P18/Q144` and `image.linked.from.wikidata.P18/Q38280` * `image.linked.from.wikidata.P373` will store ids for any wikidata item that is linked via P373 (commons category) to any commons category that the image belongs to ** e.g. if wikidata item Q144 has its property P373 (commons category) set to `Dogs` ** AND **Image_X** is in the commons category `Dogs` ** then for **Image_X** we'll set the field `image.linked.from.wikidata.P373/Q144|<score>` ** <score> will be an integer between 0 and 1000, proportional to the **inverse** of the number of images in the category (because a category with fewer images is more specific, and therefore a better signal) * `image.linked.from.wikidata.sitelink` will store the wikidata items of any wiki article the image is used in ** e.g. if **Image_X** is used on `https://ga.wikipedia.org/Page_Y` ** AND `https://ga.wikipedia.org/Page_Y` has a corresponding wikidata id `Q12345` ** then for **Image_X** we'll set the field `image.linked.from.sitelink/Q12345|<score>` ** <score> will be an integer between 0 and 1000, set proportional to the importance of all pages with Q12345 across all wikis The extra search data should **not** be added to any image that is excluded by the current Image Suggestions Algorithm, namely: * images in any of the "placeholder images" categories (or their subcategories) on commons * images that are already used on a large number of pages on any wiki (as they are likely to be placeholders) * images whose titles contain strings that indicate they are likely to be placeholders For more exact definitions of the above see [[ https://github.com/mirrys/ImageMatching/blob/main/algorithm.ipynb | the Image Suggestions Algorithm code ]]

In order to experiment with integrating Image Matching Algorithm data in the commons search index, we need to create a new index on relforge to experiment with For an example of copying an index from production to relforge [[ https://phabricator.wikimedia.org/P16419 | see here ]] For an example of augmenting a wiki dump with extra data and writing the whole lot to elastic [[ https://gerrit.wikimedia.org/r/plugins/gitiles/wikimedia/discovery/relevanceForge/+/refs/heads/master/other_tools/augmentdump.py | see here ]] The new data that we want in the dump is three new sets of property-value pairs, plus a score, in the weighted_tags field: * `image.linked.from.wikidata.P18` will store wikidata item ids from which the image is linked via the P18 (image) property ** e.g. if the value of the P18 (image) property for wikidata items Q144 and Q38280 is set to **Image_X** ** then for **Image_X** we'll set the fields `image.linked.from.wikidata.P18/Q144` and `image.linked.from.wikidata.P18/Q38280` * `image.linked.from.wikidata.P373` will store ids for any wikidata item that is linked via P373 (commons category) to any commons category that the image belongs to ** e.g. if wikidata item Q144 has its property P373 (commons category) set to `Dogs` ** AND **Image_X** is in the commons category `Dogs` ** score will be set proportional to the **inverse** of the number of images in the category (because a category with fewer images is more specific, and therefore a better signal) ** then for **Image_X** we'll set the field `image.linked.from.wikidata.P373/Q144|<score>` ** <score> will be an integer between 0 and 1000, proportional to the **inverse** of the number of images in the category (because a category with fewer images is more specific, and therefore a better signal) * `image.linked.from.wikidata.sitelink` will store the wikidata items of any wiki article the image is used in ** e.g. if **Image_X** is used on `https://ga.wikipedia.org/Page_Y` ** AND `https://ga.wikipedia.org/Page_Y` has a corresponding wikidata id `Q12345` ** then for **Image_X** we'll set the field `image.linked.from.sitelink/Q12345|<score>` ** ** <score> will bebe an integer between 0 and 1000, set proportional to the importance of all pages with Q12345 across all wikis The extra search data should **not** be added to any image that is excluded by the current Image Suggestions Algorithm, namely: * images in any of the "placeholder images" categories (or their subcategories) on commons * images that are already used on a large number of pages on any wiki (as they are likely to be placeholders) * images whose titles contain strings that indicate they are likely to be placeholders For more exact definitions of the above see [[ https://github.com/mirrys/ImageMatching/blob/main/algorithm.ipynb | the Image Suggestions Algorithm code ]]