Now that we have a labelled dataset we have a way of testing if a search profile is better or worse
We need a script that runs all the searches that have been used to get the labelled data, and calculates various metrics based on the data that we have
Suggested metrics:
[[ https://en.wikipedia.org/wiki/F-score | f1 score ]]
[[ http://mvpa.blogspot.com/2015/12/balanced-accuracy-what-and-why.html | Balanced accuracy ]]
We’ll probably need to spend a little while investigating and deciding what metrics to use. Ultimately we can only optimise for a single metric. ATM f1-score seems like a good choice, in that it measures both precision and recall, but let’s see what kind of results we get.
For example, if the new scoring model gives us 0.7 for image X when we search for Y, we need to run tests to make sure that around 70% of the images with a score of 0.7 are good matches.
Testing and calibration here will help determine whether we can use the score as a confidence score.
Based on the results of testing the model in T271799 (the testing is this ticket), as well as the testing of the ML model in T271803, we can determine which approach is better, or use one to refine the results of the other.
Note: if we can't get interpretatable results this way, because the data we already have from the ratings tool is too imbalanced, then we will need to consider creating another manual rating tool similar to https://media-search-signal-test.toolforge.org/ targeted to get more balanced results so that we can manually rate search results as needed.