Page MenuHomePhabricator

[XL] Create new index on relforge incorporating Image Matching Algorithm data
Closed, ResolvedPublic

Description

In order to experiment with integrating Image Matching Algorithm data in the commons search index, we need to create a new index on relforge to experiment with

For an example of copying an index from production to relforge see here

For an example of augmenting a wiki dump with extra data and writing the whole lot to elastic see here

The new data that we want in the dump is three new sets of property-value pairs, plus a score, in the weighted_tags field:

  • image.linked.from.wikidata.p18 will store wikidata item ids from which the image is linked via the P18 (image) property
    • e.g. if the value of the P18 (image) property for wikidata items Q144 and Q38280 is set to Image_X
    • then for Image_X we'll set the fields image.linked.from.wikidata.p18/Q144 and image.linked.from.wikidata.p18/Q38280
  • image.linked.from.wikidata.p373 will store ids for any wikidata item that is linked via P373 (commons category) to any commons category that the image belongs to
    • e.g. if wikidata item Q144 has its property P373 (commons category) set to Dogs
    • AND Image_X is in the commons category Dogs
    • then for Image_X we'll set the field image.linked.from.wikidata.p373/Q144|<score>
    • <score> will be an integer between 0 and 1000, proportional to the inverse of the number of images in the category (because a category with fewer images is more specific, and therefore a better signal)
  • image.linked.from.wikidata.sitelink will store the wikidata items of any wiki article the image is used in
    • e.g. if Image_X is used on https://ga.wikipedia.org/Page_Y
    • AND https://ga.wikipedia.org/Page_Y has a corresponding wikidata id Q12345
    • then for Image_X we'll set the field image.linked.from.sitelink/Q12345|<score>
    • <score> will be an integer between 0 and 1000, proportional to the importance of all pages with wikidata id Q12345 across all wikis (using incoming links via the pagelinks table to give a measure of "importance")

The extra search data should not be added to any image that is excluded by the current Image Suggestions Algorithm, namely:

  • images in any of the "placeholder images" categories (or their subcategories) on commons
  • images that are already used on a large number of pages on any wiki (as they are likely to be placeholders)
  • images whose titles contain strings that indicate they are likely to be placeholders

For more exact definitions of the above see the Image Suggestions Algorithm code

Event Timeline

Cparle updated the task description. (Show Details)
CBogen renamed this task from Create new index on relforge incorporating Image Matching Algorithm data to [XL] Create new index on relforge incorporating Image Matching Algorithm data.Jul 14 2021, 4:42 PM

Update - I have a jupyter notebook that's successfully gathering the data, and have created a new index on relforge to hold it, but haven't succeeded in pushing the live commons data index the new index yet (having some access issues with the relforge server)

Looking at the new data it seems like we'll be able to add new search data to ~32M out of the ~77M files on commons, so I'm optimistic that this will be a significant improvement to media search

Cparle updated the task description. (Show Details)
Cparle updated the task description. (Show Details)
Cparle updated the task description. (Show Details)
Cparle updated the task description. (Show Details)

Ok, new index is up and running with the new data at https://relforge1003.eqiad.wmnet:9243/commonswiki_file_t286562/

Here's how I did it:

1. Created the new index in relforge, and populated it from the most recent commons elasticsearch dump (of the index commonswiki_file) by following the instructions in this file

2. Gathered all the new data that we need using this code in a jupyter notebook

3. Pushed all the new data to the new index using http

... by executing the python script below on stat1007 like so spark2-submit --driver-memory 2G --executor-memory 4G --master yarn --files hdfs:///user/cparle/commons_files_related_wikidata_items --py-files /srv/deployment/wikimedia/discovery/analytics/spark/wmf_spark.py /home/cparle/T286562.py