[Epic] Interpret image search signal results
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	Cparle
	Dec 10 2020, 1:59 PM

Description

Currently we're collecting ratings for image search results using https://media-search-signal-test.toolforge.org/.

We're using single elasticsearch fields for scoring, and have recorded the position in the search results and the elasticsearch score for each image, and have asked users to rate them as good, bad or indifferent. So the tool gathered results along with the score that elasticsearch gave to individual search ranking signals.

The goal is to find a relationship between the elasticsearch score for a ranking component and the likelihood that an image is a good match for the search term. When we're gotten over 1k ratings per elasticsearch field (except for file_text, which only had ~100 non-zero scores for approx 1500 queries) we need to interpret the ratings, and to see how accurate a prediction of whether or not an image is a good result we can get using position and score for each elasticsearch field. For example, if the score for statements is 70, does that mean the image is probably good? If the score for title is 30, does that mean the image is probably bad?

See here for a previous attempt to do this with a combination of elasticsearch fields https://docs.google.com/spreadsheets/d/1vTuMyO7UZZ_r1XexXUN05OfBSw4NlSG6VcetuRmdcjA/edit#gid=1435951734

This ticket should give us a better ranking of images as well as a predictable range of scores, so that if we want to add other search signals (like whether X is the image for a wikidata item), then we can combine scores predictably. This range of scores can then be used as a stepping stone towards developing a confidence score for the accuracy of a search result (though the scores will need testing and calibration first before it can be used as a confidence score - see T271801).

Related Objects
Search...

Status	Assigned	Task
Resolved	CBogen	T267674 [Epic] Build Media Matching API for bots/scripts
Resolved	Cparle	T269852 [Epic] Interpret image search signal results
Resolved	Cparle	T271799 [L] Implement new search profile(s) based on image search signal results
Resolved	Cparle	T272710 Investigate whether the probability-of-an-image-being-good score is useful as a confidence score
Resolved	AikoChou	T274225 Multivariate logistic regression on search scores
Resolved	Cparle	T271801 Create mechanism for comparing search profiles using labelled data
Resolved	Cparle	T271803 [Epic] Improve mediasearch by using labelled data to create a model using elasticsearch learning-to-rank
Resolved	Cparle	T271806 Create elasticsearch featureset to get scores for each component in a mediasearch

Event Timeline

Cparle created this task.Dec 10 2020, 1:59 PM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptDec 10 2020, 1:59 PM

Here's the current number of ratings we have for each search signal

MariaDB [s54568__fulltextSearchResults]> select component,count(*) from results_by_component where rating is not null group by component;
+----------------+----------+
| component      | count(*) |
+----------------+----------+
| auxiliary_text |     1203 |
| caption        |      650 |
| category       |      800 |
| file_text      |        1 |
| heading        |       82 |
| redirect.title |      820 |
| statement      |      453 |
| suggest        |     1059 |
| text           |      686 |
| title          |     1043 |
+----------------+----------+

CBogen edited projects, added SDAW-MediaSearch (MediaSearch-ReleaseCandidate2), Image-Suggestions, Structured-Data-Backlog (Current Work); removed Structured-Data-Backlog.Dec 10 2020, 3:39 PM

CBogen moved this task from To Do to MediaSearch on the Image-Suggestions board.Dec 10 2020, 3:41 PM

CBogen moved this task from Incoming to Ready for Estimation on the Structured-Data-Backlog (Current Work) board.Dec 16 2020, 5:16 PM

CBogen renamed this task from Interpret image search signal results to [M] Interpret image search signal results.Dec 16 2020, 5:30 PM

CBogen moved this task from Ready for Estimation to Doing on the Structured-Data-Backlog (Current Work) board.

CBogen moved this task from MediaSearch to To Do on the Image-Suggestions board.Dec 16 2020, 7:30 PM

• Zbyszko subscribed.Jan 5 2021, 7:42 PM

Cparle claimed this task.Jan 11 2021, 4:35 PM

CBogen updated the task description. (Show Details)Jan 12 2021, 6:34 PM

CBogen moved this task from MediaSearch-ReleaseCandidate2 to MediaSearch-ImageRecs on the SDAW-MediaSearch board.Jan 13 2021, 3:58 PM

CBogen edited projects, added SDAW-MediaSearch; removed SDAW-MediaSearch (MediaSearch-ReleaseCandidate2).

CBogen moved this task from MediaSearch-ImageRecs to MediaSearch-ImageRecs on the SDAW-MediaSearch board.Jan 27 2021, 2:04 PM