In order to experiment with integrating Image Matching Algorithm data in the commons search index, we need to create a new index on relforge to experiment with
For an example of copying an index from production to relforge [[ https://phabricator.wikimedia.org/P16419 | see here ]]
For an example of augmenting a wiki dump with extra data and writing the whole lot to elastic [[ https://gerrit.wikimedia.org/r/plugins/gitiles/wikimedia/discovery/relevanceForge/+/refs/heads/master/other_tools/augmentdump.py | see here ]]
The new data that we want in the dump is three new sets of property-value pairs, plus a score, in the weighted_tags field:
* `image.linked.from.wikidata.P18` will store wikidata item ids from which the image is linked via the P18 (image) property
** e.g. if the value of the P18 (image) property for wikidata items Q144 and Q38280 is set to **Image_X**
** then for **Image_X** we'll set the fields `image.linked.from.wikidata.P18/Q144` and `image.linked.from.wikidata.P18/Q38280`
* `image.linked.from.wikidata.P373` will store ids for any wikidata item that is linked via P373 (commons category) to any commons category that the image belongs to
** e.g. if wikidata item Q144 has its property P373 (commons category) set to `Dogs`
** AND **Image_X** is in the commons category `Dogs`
** score will be set proportional to the **inverse** of the number of images in the category (because a category with fewer images is more specific, and therefore a better signal)
** then for **Image_X** we'll set the field `image.linked.from.wikidata.P373/Q144|<score>`
** <score> will be an integer between 0 and 1000, proportional to the **inverse** of the number of images in the category (because a category with fewer images is more specific, and therefore a better signal)
* `image.linked.from.wikidata.sitelink` will store the wikidata items of any wiki article the image is used in
** e.g. if **Image_X** is used on `https://ga.wikipedia.org/Page_Y`
** AND `https://ga.wikipedia.org/Page_Y` has a corresponding wikidata id `Q12345`
** then for **Image_X** we'll set the field `image.linked.from.sitelink/Q12345|<score>`
** ** <score> will bebe an integer between 0 and 1000, set proportional to the importance of all pages with Q12345 across all wikis
The extra search data should **not** be added to any image that is excluded by the current Image Suggestions Algorithm, namely:
* images in any of the "placeholder images" categories (or their subcategories) on commons
* images that are already used on a large number of pages on any wiki (as they are likely to be placeholders)
* images whose titles contain strings that indicate they are likely to be placeholders
For more exact definitions of the above see [[ https://github.com/mirrys/ImageMatching/blob/main/algorithm.ipynb | the Image Suggestions Algorithm code ]]