Page MenuHomePhabricator

Investigate whether the probability-of-an-image-being-good score is useful as a confidence score
Closed, ResolvedPublic

Description

Once T271799 is implemented, the score returned from elasticsearch that's used to rank search results should, we think, be a number between 1 and 0 that indicates the probability of an image being good

We'd like to use this as a confidence score for image matching, so we need to calibrate whether the estimated probability that we have is realistic

The simplest way to do this is to

  1. gather N new ratings from https://media-search-signal-test.toolforge.org/ and then run searches for the search terms associated with the newly-labeled images with the new search profile
  2. run searches with the new profile for all search terms for the newly labeled images
  3. record the elasticsearch scores for all the newly labelled images in the search results
  4. sort the labeled images from the search results into buckets according to their elasticsearch scores
  5. count the good/bad images in each bucket
  6. we'd expect (number good)/(number bad+number good) in each bucket to approximately equal the mid-point of the bucket - if it is, we can use the elasticsearch score as a confidence score, if it is not then we need to rethink

Question: how big should N be?

Event Timeline

Reedy renamed this task from Investigate whather the probability-of-an-image-being-good score is useful as a confidence score to Investigate whether the probability-of-an-image-being-good score is useful as a confidence score.Jan 22 2021, 3:37 PM
Cparle updated the task description. (Show Details)
Cparle updated the task description. (Show Details)
Cparle updated the task description. (Show Details)

Note: Once we have a confidence score, we'll want to decide what the cutoff confidence score should be for image recommendations (and does it differ by use case?) Also, we'll want to measure how much cutting off by each confidence score decreases the coverage of image recommendations matches returned by MediaSearch.

This will probably need to be covered in a new ticket, but I'm noting it here so we don't forget.

We have 2485 rated recommendations using mediasearch, from 984 search terms

I re-ran the searches in 3 different ways so that I could record the elasticsearch score for each of the rated results

  • using commons directly
  • on local using the elasticsearch replica indices
    • with the standard rescore profile 'classic_noboostlinks_max_boost_template'
    • with an empty rescore profile

Detailed results and graphs are in a google sheet here

We get a significantly better (R-squared = 0.983 vs R-squared=0.856) match between elasticsearch score and the % of images that are good when we don't have a rescore section in the elastic query. This actually makes sense for image recommendations - the standard rescore boosts "quality" images, and for recommendations we care more about the accuracy of a match than the quality of the image.

This suggests that we ought to have a way of choosing the rescore profile via the url, and getting the PET to set the url when fetching mediasearch results so as to turn off rescoring (see T283837)


If we use the no-rescore profile, we get the following relationship between elasticsearch score and the likelihood that an image is a good match:

score rangelikelihood of image being good
>00.2523948355
>100.2523948355
>200.2523948355
>300.3467741935
>400.3612774451
>500.4249797242
>600.5148063781
>700.6673427992
>800.7588652482
>900.88

So:
If the score is 60 or greater, there's an approx 50:50 chance that the image is good
If the score is 70 or greater, there's an approx 2:1 chance that the image is good
If the score is 80 or greater, there's an approx 3:1 chance that the image is good

For reference, here's the test results from research's image-recommendation-algorithm:

confidence_scorelikelihood of image being good
high85%
medium58%

... so from the trendline of the graph

  • an ms result with an elasticsearch score of >64 gives approx the same likelihood of a good result as an ima result with a medium confidence_score
  • an ms result with an elasticsearch score of >89 gives approx the same likelihood of a good result as an ima result with a high confidence_score

Thanks for this great write up, @Cparle! This seems really promising.

Can we get a comparison of the coverage we get on unillustrated articles for the wikis we rated with cut offs at each of the following confidence levels?

60 or greater
70 or greater
80 or greater
90 or greater

We can, but it means going through all unillustrated articles and grabbing their scores, so it's going to need a ticket in itself. I'll make one

We can, but it means going through all unillustrated articles and grabbing their scores, so it's going to need a ticket in itself. I'll make one

Okay, thanks!

Is there anything else we should test before implementing this as a score? Do we need to rerun some manual testing again? Seems like we already know what the results would be based on the above.

I think we could do another cycle of tuning search results incorporating the data from the image-recommendation test, and then graph the data again and see where we are

... but also I think it'd probably we worth talking the the PET about using what we have right now in the next iteration, so rather than the IMA always returning confidence_score or high or medium for IMA and low for MS, they can use the elastic score from MS to give high, medium and low values for MS that are comparable to those values for IMA

Maybe I'll make a ticket for that too and bring it to the attention of @BPirkle and @sdkim

I think we could do another cycle of tuning search results incorporating the data from the image-recommendation test, and then graph the data again and see where we are

... but also I think it'd probably we worth talking the the PET about using what we have right now in the next iteration, so rather than the IMA always returning confidence_score or high or medium for IMA and low for MS, they can use the elastic score from MS to give high, medium and low values for MS that are comparable to those values for IMA

Great, I'll bring this up in the image recs steering committee meeting. FYI @sdkim

@Cparle one other thing i'd like to explore - can we compare the images we return at each confidence level with the images the image algo returns to see if there's a lot of overlap?

We could do. Should I make a ticket?

@Cparle one other thing i'd like to explore - can we compare the images we return at each confidence level with the images the image algo returns to see if there's a lot of overlap?

We could do. Should I make a ticket?

@Cparle Yes please! Thanks!