Once T271799 is implemented, the score returned from elasticsearch that's used to rank search results should, we think, be a number between 1 and 0 that indicates the probability of an image being good
We'd like to use this as a confidence score for image matching, so we need to calibrate whether the estimated probability that we have is realistic
The simplest way to do this is to
1. gather N new ratings from https://media-search-signal-test.toolforge.org/ and then run searches with the new search profile
2. run searches with the new profile for all search terms for the newly labeled images
3. record the elasticsearch scores for all the newly labelled images in the search results
4. sort the labeled images from the search results into buckets according to their elasticsearch scores
5. count the good/bad images in each bucket
6. we'd expect (number good)/(number bad+number good) in each bucket to approximately equal the mid-point of the bucket - if it is, we can use the elasticsearch score as a confidence score, if it is not then we need to rethink
Question: how big should N be?