Once T271799 is implemented, the score returned from elasticsearch that's used to rank search results should, we think, be a number between 1 and 0 that indicates the probability of an image being good
We'd like to use this as a confidence score for image matching, so we need to calibrate whether the estimated probability that we have is realistic
The simplest way to do this is to
- gather N new ratings from https://media-search-signal-test.toolforge.org/ and then run searches for the search terms associated with the newly-labeled images with the new search profile
- run searches with the new profile for all search terms for the newly labeled images
- record the elasticsearch scores for all the newly labelled images in the search results
- sort the labeled images from the search results into buckets according to their elasticsearch scores
- count the good/bad images in each bucket
- we'd expect (number good)/(number bad+number good) in each bucket to approximately equal the mid-point of the bucket - if it is, we can use the elasticsearch score as a confidence score, if it is not then we need to rethink
Question: how big should N be?