Page MenuHomePhabricator

Calculate image suggestions confidence score without using elasticsearch
Closed, ResolvedPublic

Description

For the current iteration of image suggestions we have a tuned search profile where the elasticsearch score returned reflects the likelihood that an image is a good match, and we had anticipated using this as a confidence score

While generating the image suggestions data we gather data from wikidata and save it in hdfs so that it can be picked up by the search pipeline and imported into the commonswiki search index. This data is essential for calculating the confidence score ... however, we can't actually get a confidence score until the data is in the index, and therefore we're unable to finish generating the suggestions data until we're sure the data has been imported

In order to work around that, this ticket is to calculate the confidence score before the data is available in elasticsearch. Only 1 of the 4 signals used to calculate the score is bm25-based, so it should be possible

Event Timeline

After running queries on the labeled data, it turns out the most reliable confidence score is simply based on the source of the match

source of matchproportion of good images
P180.9787234043
lead image0.8839907193
commons category0.8734693878
depicts0.7577433628
no match0.3863076923
overall0.4854083314

So rounding down for safety we're gonna say

If we match an image based on P18, confidence score is 90%
If we match an image based on lead image or commons category, confidence score is 80%
If we match an image based on depicts, confidence score is 70%

See here for the full results/analysis https://docs.google.com/spreadsheets/d/1ZByYvEnwJyK4GwQ7fgreiJhobtueNqb8UgNxRU1t0Y4/edit#gid=1248021027