Investigate whether the probability-of-an-image-being-good score is useful as a confidence score
Closed, ResolvedPublic
Actions

Description

Once T271799 is implemented, the score returned from elasticsearch that's used to rank search results should, we think, be a number between 1 and 0 that indicates the probability of an image being good

We'd like to use this as a confidence score for image matching, so we need to calibrate whether the estimated probability that we have is realistic

The simplest way to do this is to

gather N new ratings from https://media-search-signal-test.toolforge.org/ and then run searches for the search terms associated with the newly-labeled images with the new search profile
run searches with the new profile for all search terms for the newly labeled images
record the elasticsearch scores for all the newly labelled images in the search results
sort the labeled images from the search results into buckets according to their elasticsearch scores
count the good/bad images in each bucket
we'd expect (number good)/(number bad+number good) in each bucket to approximately equal the mid-point of the bucket - if it is, we can use the elasticsearch score as a confidence score, if it is not then we need to rethink

Question: how big should N be?

Related Objects
Search...

Status	Assigned	Task
Resolved	CBogen	T267674 [Epic] Build Media Matching API for bots/scripts
Resolved	Cparle	T269852 [Epic] Interpret image search signal results
Resolved	Cparle	T271799 [L] Implement new search profile(s) based on image search signal results
Resolved	CBogen	T299781 [EPIC] Image suggestions backend
Resolved	CBogen	T281582 [EPIC] Develop a confidence score for MediaSearch results
Resolved	Cparle	T272710 Investigate whether the probability-of-an-image-being-good score is useful as a confidence score

Event Timeline

Cparle created this task.Jan 22 2021, 1:51 PM

Reedy renamed this task from Investigate whather the probability-of-an-image-being-good score is useful as a confidence score to Investigate whether the probability-of-an-image-being-good score is useful as a confidence score.Jan 22 2021, 3:37 PM

CBogen moved this task from Backlog to MediaSearch-ImageRecs on the SDAW-MediaSearch board.Jan 27 2021, 2:12 PM

CBogen edited projects, added SDAW-MediaSearch (MediaSearch-ImageRecs); removed SDAW-MediaSearch.

CBogen added a parent task: T267674: [Epic] Build Media Matching API for bots/scripts.Jan 28 2021, 9:03 PM

CBogen removed a parent task: T267674: [Epic] Build Media Matching API for bots/scripts.

Blocked by T271799

Cparle updated the task description. (Show Details)Feb 1 2021, 5:46 PM

Cparle updated the task description. (Show Details)Feb 1 2021, 6:25 PM

Cparle updated the task description. (Show Details)

CBogen mentioned this in T273882: [M] Estimate how many unillustrated articles on Cebuano and Arabic wikis would have matches in MediaSearch .Feb 4 2021, 2:11 PM

Now blocked by T273092

Note: Once we have a confidence score, we'll want to decide what the cutoff confidence score should be for image recommendations (and does it differ by use case?) Also, we'll want to measure how much cutting off by each confidence score decreases the coverage of image recommendations matches returned by MediaSearch.

This will probably need to be covered in a new ticket, but I'm noting it here so we don't forget.

CBogen mentioned this in T281582: [EPIC] Develop a confidence score for MediaSearch results.Apr 30 2021, 2:45 PM

CBogen added a parent task: T281582: [EPIC] Develop a confidence score for MediaSearch results.

CBogen moved this task from Blocked to Ready for Development on the Structured-Data-Backlog (Current Work) board.May 11 2021, 3:34 PM

Cparle moved this task from Ready for Development to Doing on the Structured-Data-Backlog (Current Work) board.May 26 2021, 10:21 AM

We have 2485 rated recommendations using mediasearch, from 984 search terms

I re-ran the searches in 3 different ways so that I could record the elasticsearch score for each of the rated results

using commons directly
on local using the elasticsearch replica indices
- with the standard rescore profile 'classic_noboostlinks_max_boost_template'
- with an empty rescore profile

Detailed results and graphs are in a google sheet here

We get a significantly better (R-squared = 0.983 vs R-squared=0.856) match between elasticsearch score and the % of images that are good when we don't have a rescore section in the elastic query. This actually makes sense for image recommendations - the standard rescore boosts "quality" images, and for recommendations we care more about the accuracy of a match than the quality of the image.

This suggests that we ought to have a way of choosing the rescore profile via the url, and getting the PET to set the url when fetching mediasearch results so as to turn off rescoring (see T283837)

If we use the no-rescore profile, we get the following relationship between elasticsearch score and the likelihood that an image is a good match:

score range	likelihood of image being good
>0	0.2523948355
>10	0.2523948355
>20	0.2523948355
>30	0.3467741935
>40	0.3612774451
>50	0.4249797242
>60	0.5148063781
>70	0.6673427992
>80	0.7588652482
>90	0.88

So:
If the score is 60 or greater, there's an approx 50:50 chance that the image is good
If the score is 70 or greater, there's an approx 2:1 chance that the image is good
If the score is 80 or greater, there's an approx 3:1 chance that the image is good

For reference, here's the test results from research's image-recommendation-algorithm:

confidence_score	likelihood of image being good
high	85%
medium	58%

... so from the trendline of the graph

an ms result with an elasticsearch score of >64 gives approx the same likelihood of a good result as an ima result with a medium confidence_score
an ms result with an elasticsearch score of >89 gives approx the same likelihood of a good result as an ima result with a high confidence_score

Thanks for this great write up, @Cparle! This seems really promising.

Can we get a comparison of the coverage we get on unillustrated articles for the wikis we rated with cut offs at each of the following confidence levels?

60 or greater
70 or greater
80 or greater
90 or greater

We can, but it means going through all unillustrated articles and grabbing their scores, so it's going to need a ticket in itself. I'll make one

In T272710#7119710, @Cparle wrote:

We can, but it means going through all unillustrated articles and grabbing their scores, so it's going to need a ticket in itself. I'll make one

Okay, thanks!

Is there anything else we should test before implementing this as a score? Do we need to rerun some manual testing again? Seems like we already know what the results would be based on the above.

I think we could do another cycle of tuning search results incorporating the data from the image-recommendation test, and then graph the data again and see where we are

... but also I think it'd probably we worth talking the the PET about using what we have right now in the next iteration, so rather than the IMA always returning confidence_score or high or medium for IMA and low for MS, they can use the elastic score from MS to give high, medium and low values for MS that are comparable to those values for IMA

Maybe I'll make a ticket for that too and bring it to the attention of @BPirkle and @sdkim

Cparle added subscribers: • sdkim, BPirkle.May 27 2021, 3:57 PM

In T272710#7119852, @Cparle wrote:

I think we could do another cycle of tuning search results incorporating the data from the image-recommendation test, and then graph the data again and see where we are

... but also I think it'd probably we worth talking the the PET about using what we have right now in the next iteration, so rather than the IMA always returning confidence_score or high or medium for IMA and low for MS, they can use the elastic score from MS to give high, medium and low values for MS that are comparable to those values for IMA

Great, I'll bring this up in the image recs steering committee meeting. FYI @sdkim

Cparle mentioned this in T283837: Provide a way to set mediasearch rescore profile via the url.May 27 2021, 4:36 PM

@Cparle one other thing i'd like to explore - can we compare the images we return at each confidence level with the images the image algo returns to see if there's a lot of overlap?

Cparle mentioned this in T283863: Incorporate image-recommendation-test results into the image recs API confidence score.May 27 2021, 7:24 PM

Cparle mentioned this in T283865: [XL] Estimate coverage of image suggestions at different confidence levels.May 27 2021, 7:34 PM

We could do. Should I make a ticket?

In T272710#7120259, @CBogen wrote:

@Cparle one other thing i'd like to explore - can we compare the images we return at each confidence level with the images the image algo returns to see if there's a lot of overlap?

In T272710#7142422, @Cparle wrote:

We could do. Should I make a ticket?

@Cparle Yes please! Thanks!

Cparle closed this task as Resolved.Jun 14 2021, 4:21 PM

Investigate whether the probability-of-an-image-being-good score is useful as a confidence scoreClosed, ResolvedPublicActions

Description

Related ObjectsSearch...

Event Timeline

Investigate whether the probability-of-an-image-being-good score is useful as a confidence score
Closed, ResolvedPublic
Actions

Related Objects
Search...