Currently we're collecting ratings for image search results using https://media-search-signal-test.toolforge.org/.
We're using single elasticsearch fields for scoring, and have recorded the position in the search results and the elasticsearch score for each image, and have asked users to rate them as good, bad or indifferent. So the tool gathered results along with the score that elasticsearch gave to individual search ranking signals.
The goal is to find a relationship between the elasticsearch score for a ranking component and the likelihood that an image is a good match for the search term. When we're gotten over 1k ratings per elasticsearch field (except for file_text, which only had ~100 non-zero scores for approx 1500 queries) we need to interpret the ratings, and to see how accurate a prediction of whether or not an image is a good result we can get using position and score for each elasticsearch field. For example, if the score for statements is 70, does that mean the image is probably good? If the score for title is 30, does that mean the image is probably bad?
See here for a previous attempt to do this with a combination of elasticsearch fields https://docs.google.com/spreadsheets/d/1vTuMyO7UZZ_r1XexXUN05OfBSw4NlSG6VcetuRmdcjA/edit#gid=1435951734
This ticket should give us a better ranking of images as well as a predictable range of scores, so that if we want to add other search signals (like whether X is the image for a wikidata item), then we can combine scores predictably. This range of scores can then be used as a stepping stone towards developing a confidence score for the accuracy of a search result (though the scores will need testing and calibration first before it can be used as a confidence score - see T271801).