In order to experiment with integrating Image Matching Algorithm data in the commons search index, we need to create a new index on relforge to experiment with
For an example of copying an index from production to relforge see here
For an example of augmenting a wiki dump with extra data and writing the whole lot to elastic see here
The new data that we want in the dump is three new sets of property-value pairs, plus a score, in the weighted_tags field:
- image.linked.from.wikidata.p18 will store wikidata item ids from which the image is linked via the P18 (image) property
- e.g. if the value of the P18 (image) property for wikidata items Q144 and Q38280 is set to Image_X
- then for Image_X we'll set the fields image.linked.from.wikidata.p18/Q144 and image.linked.from.wikidata.p18/Q38280
- image.linked.from.wikidata.p373 will store ids for any wikidata item that is linked via P373 (commons category) to any commons category that the image belongs to
- e.g. if wikidata item Q144 has its property P373 (commons category) set to Dogs
- AND Image_X is in the commons category Dogs
- then for Image_X we'll set the field image.linked.from.wikidata.p373/Q144|<score>
- <score> will be an integer between 0 and 1000, proportional to the inverse of the number of images in the category (because a category with fewer images is more specific, and therefore a better signal)
- image.linked.from.wikidata.sitelink will store the wikidata items of any wiki article the image is used in
- e.g. if Image_X is used on https://ga.wikipedia.org/Page_Y
- AND https://ga.wikipedia.org/Page_Y has a corresponding wikidata id Q12345
- then for Image_X we'll set the field image.linked.from.sitelink/Q12345|<score>
- <score> will be an integer between 0 and 1000, proportional to the importance of all pages with wikidata id Q12345 across all wikis (using incoming links via the pagelinks table to give a measure of "importance")
The extra search data should not be added to any image that is excluded by the current Image Suggestions Algorithm, namely:
- images in any of the "placeholder images" categories (or their subcategories) on commons
- images that are already used on a large number of pages on any wiki (as they are likely to be placeholders)
- images whose titles contain strings that indicate they are likely to be placeholders
For more exact definitions of the above see the Image Suggestions Algorithm code