Change Details

Now that we're pretty sure that pushing wikidata information into the `weighted_tags` field in the commons index improves image search on an experimental index, we need to do the same for the production commonswiki_file index **At the same time** we also want to populate the `hasrecommendation` flag in the search indices for various wikipedias (you'll need to consult Growth to find out which ones). The 2 parts need to be done at the same time because the data is related and we want it to be consistent. The easiest way to do this is via an airflow job that runs every X days/weeks. **Note that we'll need to delete old data as well as adding new data every time the job runs** Part 1 -- The existing notebook code for how the `weighted_tags` data was gathered is attached to T286562 The new data that we want is three new sets of property-value pairs, plus a score, in the weighted_tags field: * `image.linked.from.wikidata.p18` will store wikidata item ids from which the image is linked via the P18 (image) property ** e.g. if the value of the P18 (image) property for wikidata items Q144 and Q38280 is set to **Image_X** ** then for **Image_X** we'll set the fields `image.linked.from.wikidata.p18/Q144` and `image.linked.from.wikidata.p18/Q38280` * `image.linked.from.wikidata.p373` will store ids for any wikidata item that is linked via P373 (commons category) to any commons category that the image belongs to ** e.g. if wikidata item Q144 has its property P373 (commons category) set to `Dogs` ** AND **Image_X** is in the commons category `Dogs` ** then for **Image_X** we'll set the field `image.linked.from.wikidata.p373/Q144|<score>` ** <score> will be an integer between 0 and 1000, proportional to the **inverse** of the number of images in the category (because a category with fewer images is more specific, and therefore a better signal) ** note that we need an upper bound on the size of the category - if many categories contains thousands of images they'll increase the size of the index without providing a useful signal. In the original jupyter notebook for the experimental index the size of the categories was limited to 100k, but we could probably make that a lot smaller * `image.linked.from.wikidata.sitelink` will store the wikidata items of any wiki article the image is used in ** e.g. if **Image_X** is used on `https://ga.wikipedia.org/Page_Y` ** AND `https://ga.wikipedia.org/Page_Y` has a corresponding wikidata id `Q12345` ** then for **Image_X** we'll set the field `image.linked.from.wikidata.sitelink/Q12345|<score>` ** <score> will be an integer between 0 and 1000, proportional to the importance of all pages with wikidata id `Q12345` across all wikis (using incoming links via the `pagelinks` table to give a measure of "importance") **Also** investigate add a fourth property-value pair with a score into the weighted_tags field (this is not done currently, so is not in the notebook code) * `image.linked.from.wikidata.lead_image` will store the wikidata items of any wiki article the image is the lead image for ** e.g. if **Image_X** is **the lead image** on `https://ga.wikipedia.org/Page_Y` - i.e. Image_X's title is the value of the 'page_image_free' page_prop for https://ga.wikipedia.org/Page_Y ** AND `https://ga.wikipedia.org/Page_Y` has a corresponding wikidata id `Q12345` ** then for **Image_X** we'll set the field `image.linked.from.wikidata.lead_image/Q12345|<score>` ** <score> will be an integer between 0 and 1000, proportional to the importance of all pages with wikidata id `Q12345` across all wikis (using incoming links via the `pagelinks` table to give a measure of "importance") The extra search data should **not** be added to any image that is excluded by the current Image Suggestions Algorithm, namely: * images in any of the "placeholder images" categories (or their subcategories) on commons * images that are already used on a large number of pages on any wiki (as they are likely to be placeholders) * images whose titles contain strings that indicate they are likely to be placeholders * probably some more too - need consult with the research team For more exact definitions of the above see [[ https://github.com/mirrys/ImageMatching/blob/main/algorithm.ipynb | the Image Suggestions Algorithm code ]] Part 2 -- * gather all wikidata-ids written to the commonswiki_file index above * gather all wikidata-ids from all commons `depicts` and `is digital representation of`statements * merge the two sets into one collection of wikidata ids on commons * then for each relevant wiki find all unillustrated articles ([[ https://github.com/mirrys/ImageMatching/blob/main/algorithm.ipynb | see the Image Suggestions Algorithm code for how ]]) with their wikidata-ids * if the wikidata-id of an article is in the collection of wikidata ids on commons, write `hasrecommdentation:true` into its search index

Now that we're pretty sure that pushing wikidata information into the `weighted_tags` field in the commons index improves image search on an experimental index, we need to do the same for the production commonswiki_file index **At the same time** we also want to populate the `hasrecommendation` flag in the search indices for various wikipedias (you'll need to consult Growth to find out which ones). The 2 parts need to be done at the same time because the data is related and we want it to be consistent. The easiest way to do this is via an airflow job that runs every X days/weeks. **Note that we'll need to delete old data as well as adding new data every time the job runs** Part 1 -- The existing notebook code for how the `weighted_tags` data was gathered is attached to T286562 The new data that we want is three new sets of property-value pairs, plus a score, in the weighted_tags field: * `image.linked.from.wikidata.p18` will store wikidata item ids from which the image is linked via the P18 (image) property ** e.g. if the value of the P18 (image) property for wikidata items Q144 and Q38280 is set to **Image_X** ** then for **Image_X** we'll set the fields `image.linked.from.wikidata.p18/Q144` and `image.linked.from.wikidata.p18/Q38280` * `image.linked.from.wikidata.p373` will store ids for any wikidata item that is linked via P373 (commons category) to any commons category that the image belongs to ** e.g. if wikidata item Q144 has its property P373 (commons category) set to `Dogs` ** AND **Image_X** is in the commons category `Dogs` ** then for **Image_X** we'll set the field `image.linked.from.wikidata.p373/Q144|<score>` ** <score> will be an integer between 0 and 1000, proportional to the **inverse** of the number of images in the category (because a category with fewer images is more specific, and therefore a better signal) ** note that we need an upper bound on the size of the category - if many categories contains thousands of images they'll increase the size of the index without providing a useful signal. In the original jupyter notebook for the experimental index the size of the categories was limited to 100k, but we could probably make that a lot smaller * `image.linked.from.wikidata.sitelink` will store the wikidata items of any wiki article the image is used in ** e.g. if **Image_X** is used on `https://ga.wikipedia.org/Page_Y` ** AND `https://ga.wikipedia.org/Page_Y` has a corresponding wikidata id `Q12345` ** then for **Image_X** we'll set the field `image.linked.from.wikidata.sitelink/Q12345|<score>` ** <score> will be an integer between 0 and 1000, proportional to the importance of all pages with wikidata id `Q12345` across all wikis (using incoming links via the `pagelinks` table to give a measure of "importance") **Also** investigate add a fourth property-value pair with a score into the weighted_tags field (this is not done currently, so is not in the notebook code) * `image.linked.from.wikidata.lead_image` will store the wikidata items of any wiki article the image is the lead image for ** e.g. if **Image_X** is **the lead image** on `https://ga.wikipedia.org/Page_Y` - i.e. Image_X's title is the value of the 'page_image_free' page_prop for https://ga.wikipedia.org/Page_Y ** AND `https://ga.wikipedia.org/Page_Y` has a corresponding wikidata id `Q12345` ** then for **Image_X** we'll set the field `image.linked.from.wikidata.lead_image/Q12345|<score>` ** <score> will be an integer between 0 and 1000, proportional to the importance of all pages with wikidata id `Q12345` across all wikis (using incoming links via the `pagelinks` table to give a measure of "importance") The extra search data should **not** be added to any image that is excluded by the current Image Suggestions Algorithm, namely: * images in any of the "placeholder images" categories (or their subcategories) on commons * images that are already used on a large number of pages on any wiki (as they are likely to be placeholders) * images whose titles contain strings that indicate they are likely to be placeholders * probably some more too - need consult with the research team For more exact definitions of the above see [[ https://github.com/mirrys/ImageMatching/blob/main/algorithm.ipynb | the Image Suggestions Algorithm code ]] Part 2 -- * gather all wikidata-ids written to the commonswiki_file index above * gather all wikidata-ids from all commons `depicts` and `is digital representation of`statements * merge the two sets into one collection of wikidata ids on commons * then for each relevant wiki find all unillustrated articles ([[ https://github.com/mirrys/ImageMatching/blob/main/algorithm.ipynb | see the Image Suggestions Algorithm code for how ]]) with their wikidata-ids * if the wikidata-id of an article is in the collection of wikidata ids on commons, write `hasrecommdentation:true` into its elasticsearch document

Now that we're pretty sure that pushing wikidata information into the `weighted_tags` field in the commons index improves image search on an experimental index, we need to do the same for the production commonswiki_file index **At the same time** we also want to populate the `hasrecommendation` flag in the search indices for various wikipedias (you'll need to consult Growth to find out which ones). The 2 parts need to be done at the same time because the data is related and we want it to be consistent. The easiest way to do this is via an airflow job that runs every X days/weeks. **Note that we'll need to delete old data as well as adding new data every time the job runs** Part 1 -- The existing notebook code for how the `weighted_tags` data was gathered is attached to T286562 The new data that we want is three new sets of property-value pairs, plus a score, in the weighted_tags field: * `image.linked.from.wikidata.p18` will store wikidata item ids from which the image is linked via the P18 (image) property ** e.g. if the value of the P18 (image) property for wikidata items Q144 and Q38280 is set to **Image_X** ** then for **Image_X** we'll set the fields `image.linked.from.wikidata.p18/Q144` and `image.linked.from.wikidata.p18/Q38280` * `image.linked.from.wikidata.p373` will store ids for any wikidata item that is linked via P373 (commons category) to any commons category that the image belongs to ** e.g. if wikidata item Q144 has its property P373 (commons category) set to `Dogs` ** AND **Image_X** is in the commons category `Dogs` ** then for **Image_X** we'll set the field `image.linked.from.wikidata.p373/Q144|<score>` ** <score> will be an integer between 0 and 1000, proportional to the **inverse** of the number of images in the category (because a category with fewer images is more specific, and therefore a better signal) ** note that we need an upper bound on the size of the category - if many categories contains thousands of images they'll increase the size of the index without providing a useful signal. In the original jupyter notebook for the experimental index the size of the categories was limited to 100k, but we could probably make that a lot smaller * `image.linked.from.wikidata.sitelink` will store the wikidata items of any wiki article the image is used in ** e.g. if **Image_X** is used on `https://ga.wikipedia.org/Page_Y` ** AND `https://ga.wikipedia.org/Page_Y` has a corresponding wikidata id `Q12345` ** then for **Image_X** we'll set the field `image.linked.from.wikidata.sitelink/Q12345|<score>` ** <score> will be an integer between 0 and 1000, proportional to the importance of all pages with wikidata id `Q12345` across all wikis (using incoming links via the `pagelinks` table to give a measure of "importance") **Also** investigate add a fourth property-value pair with a score into the weighted_tags field (this is not done currently, so is not in the notebook code) * `image.linked.from.wikidata.lead_image` will store the wikidata items of any wiki article the image is the lead image for ** e.g. if **Image_X** is **the lead image** on `https://ga.wikipedia.org/Page_Y` - i.e. Image_X's title is the value of the 'page_image_free' page_prop for https://ga.wikipedia.org/Page_Y ** AND `https://ga.wikipedia.org/Page_Y` has a corresponding wikidata id `Q12345` ** then for **Image_X** we'll set the field `image.linked.from.wikidata.lead_image/Q12345|<score>` ** <score> will be an integer between 0 and 1000, proportional to the importance of all pages with wikidata id `Q12345` across all wikis (using incoming links via the `pagelinks` table to give a measure of "importance") The extra search data should **not** be added to any image that is excluded by the current Image Suggestions Algorithm, namely: * images in any of the "placeholder images" categories (or their subcategories) on commons * images that are already used on a large number of pages on any wiki (as they are likely to be placeholders) * images whose titles contain strings that indicate they are likely to be placeholders * probably some more too - need consult with the research team For more exact definitions of the above see [[ https://github.com/mirrys/ImageMatching/blob/main/algorithm.ipynb | the Image Suggestions Algorithm code ]] Part 2 -- * gather all wikidata-ids written to the commonswiki_file index above * gather all wikidata-ids from all commons `depicts` and `is digital representation of`statements * merge the two sets into one collection of wikidata ids on commons * then for each relevant wiki find all unillustrated articles ([[ https://github.com/mirrys/ImageMatching/blob/main/algorithm.ipynb | see the Image Suggestions Algorithm code for how ]]) with their wikidata-ids * if the wikidata-id of an article is in the collection of wikidata ids on commons, write `hasrecommdentation:true` into its elasticsearch indexdocument