Page MenuHomePhabricator

Create mechanism for comparing search profiles using labelled data
Closed, ResolvedPublic

Description

Now that we have a labelled dataset we have a way of testing if a search profile is better or worse

We need a script that runs all the searches that have been used to get the labelled data, and calculates various metrics based on the data that we have

Suggested metrics:
f1 score
Balanced accuracy

We’ll probably need to spend a little while investigating and deciding what metrics to use. Ultimately we can only optimise for a single metric. ATM f1-score seems like a good choice, in that it measures both precision and recall, but let’s see what kind of results we get.

For example, if the new scoring model gives us 0.7 for image X when we search for Y, we need to run tests to make sure that around 70% of the images with a score of 0.7 are good matches.

Testing and calibration here will help determine whether we can use the score as a confidence score.

Based on the results of testing the model in T271799 (the testing is this ticket), as well as the testing of the ML model in T271803, we can determine which approach is better, or use one to refine the results of the other.

Note: if we can't get interpretatable results this way, because the data we already have from the ratings tool is too imbalanced, then we will need to consider creating another manual rating tool similar to https://media-search-signal-test.toolforge.org/ targeted to get more balanced results so that we can manually rate search results as needed.

Event Timeline

Note: if we can't get interpretatable results this way then we will need to consider creating a manual rating tool similar to https://media-search-signal-test.toolforge.org/ so we can manually rate search results

@Cparle I'm a bit confused. Isn't this task to rate the results that were determined using the https://media-search-signal-test.toolforge.org/ tool? Why would we need to create another manual rating tool?

@Cparle I'm a bit confused. Isn't this task to rate the results that were determined using the https://media-search-signal-test.toolforge.org/ tool? Why would we need to create another manual rating tool?

I talked with @Cparle on Slack to answer this, and he said the following. I've updated the description of the ticket as well.

Say the new scoring model gives us 0.7 for image X when we search for Y. I feel like we'd need to run tests to make sure that around 70% of the images with a score of 0.7 are good. We might be able to do that automatically with the data we have, or we might find the data isn't comprehensive enough. I don't really know at this stage, but it's conceivable that the data we have might be too imbalanced to allow us to draw good conclusions. For example if all the search results for Martin Luther King that we currently have are classified as "bad" then it's hard to know is a new way of searching has improved things. The data might be fine, we won't really know until we start trying to do automatic tests.

I moved this into "doing" because myself and @matthiasmullie have been talking about it, and I've been doing some prototypes to figure out what works, and I wanted to make the discussion more public

The code I've been using is here https://github.com/cormacparle/media-search-signal-test/pull/7

What I've done is run a search on everything we have labeled data for, then gone through the labeled data we have for that search and worked out a score for each based on how many of the good/bad images are in the results

The scores I've worked out are:
f1score - the harmonic mean of precision and recall
precision of the top 30 results
a custom score which is the sum of the reciprocal of the positions for all good results minus the sum of the reciprocal of the revisions for all bad results

Here are some results

MediaSearchStandard searchMediaSearch with statements, captions, suggest boost by 1, all other search signals ignored
mean f1 score first 100 results0.3920.3910.416
mean precision first 30 results0.5570.5350.567
custom score first 100 results146144226

Thanks for this @Cparle! To make sure I understand - a higher score is better, in all three versions, right?

Can we run a 4th category, which would be "MediaSearch with statements, captions, suggest boost by 1", but not actually ignore all other search signals?

Also, what is the "suggest" input?

Also, what is the "suggest" input?

"suggest" is a derived field, it's one of the other fields (can't remember which one) split into an array of 2-word "n-grams"

Can we run a 4th category, which would be "MediaSearch with statements, captions, suggest boost by 1", but not actually ignore all other search signals?

We could, but I'm not sure what it would tell us. Is there a specific question you're trying to answer?

(edit: oh and yes, higher is better)

(edit: oh and yes, higher is better)

Great - seems clear that MediaSearch is better than Special:Search, but also that there's a clear path to even more improvement.

Can we run a 4th category, which would be "MediaSearch with statements, captions, suggest boost by 1", but not actually ignore all other search signals?

We could, but I'm not sure what it would tell us. Is there a specific question you're trying to answer?

I'd like to see what the scores are if we don't actually ignore the worse search inputs but still boost the best inputs. Does the score go down? I'm asking because I don't want to actually remove any fields from what's being searched unless we have a good reason to do so. The community expects data they add to a file to be searchable (even if it's not high in the rankings). If we have a good reason not to make it searchable, we can, but we'd need to communicate that.

Funny enough myself and Matthias have been talking a lot about this already this afternoon, and the "MediaSearch with statements ...." column still will find things in those fields, it just won't use the fields for ranking the results. So maybe we already have what you want?

Funny enough myself and Matthias have been talking a lot about this already this afternoon, and the "MediaSearch with statements ...." column still will find things in those fields, it just won't use the fields for ranking the results. So maybe we already have what you want?

Hmm. So if it finds the things in those fields, but the item it finds doesn't have any other info that *is* used in ranking, where would it show up in the search results? Automatically at the end or...?

Yeah, automatically at the end.

Yeah, automatically at the end.

Cool, sounds like that's already what I'm looking for then.

Hmm. So if it finds the things in those fields, but the item it finds doesn't have any other info that *is* used in ranking, where would it show up in the search results? Automatically at the end or...?

That MediaSearch with ... signals ignored is not something we'd even consider implementing, ever (this would come close to the original POC, where depicts results always come first: they're quite reliable, but monotonous, etc)
That column was more of a "let's figure out whether these score aggregates work, what exactly do they tell us, and are they able to inform us in a way that's helpful."

We've further refined the way these metrics are calculated, working around all the ways in which we found the partial data or limited search results were having an unjust influence on the scores.
This is now done & ready to compare different search profiles/implementations once we have them (though we could always do with more data if anyone wants to classify more at https://media-search-signal-test.toolforge.org/)
The code lives here: https://github.com/cormacparle/media-search-signal-test


FYI: here's a complete breakdown of what we have so far (numbers are still inexact because of limitations we can't work around, but they should be representative enough for rough comparison purposes)

plain search

F1 Score0.67945701357466
Precision@100.77680798004988
Precision@250.73318551367332
Precision@500.70391644908616
Precision@1000.67066099729416
Recall0.80213675213675

mediasearch (query builder only)

F1 Score0.69368421052632
Precision@100.76686390532544
Precision@250.74383164005806
Precision@500.70824847250509
Precision@1000.68235730170497
Recall0.84487179487179

mediasearch (query builder + rescore)

F1 Score0.7004473503097
Precision@100.77661795407098
Precision@250.74642392717815
Precision@500.71475409836066
Precision@1000.6880829015544
Recall0.86987179487179

statement/descriptions/suggest

F1 Score0.67109173766594
Precision@100.80924855491329
Precision@250.75811209439528
Precision@500.72195640616693
Precision@1000.68354927365528
Recall0.72371794871795

statement

F1 Score0.30386266094421
Precision@100.80633802816901
Precision@250.81220657276995
Precision@500.80919931856899
Precision@1000.77407407407407
Recall0.18910256410256

descriptions

F1 Score0.27216225057246
Precision@100.6463700234192
Precision@250.63875205254516
Precision@500.60849598163031
Precision@1000.57773851590106
Recall0.17777777777778

title

F1 Score0.66893704850361
Precision@100.8078335373317
Precision@250.75580464371497
Precision@500.72262367982212
Precision@1000.68483883804218
Recall0.69252136752137

category

F1 Score0.6050321825629
Precision@100.76377952755906
Precision@250.73492286115007
Precision@500.68975069252078
Precision@1000.65282083075015
Recall0.55235042735043

heading

F1 Score0.013391922996443
Precision@100.40740740740741
Precision@250.41666666666667
Precision@500.36619718309859
Precision@1000.32631578947368
Recall0.0068376068376068

auxiliary_text

F1 Score0.63614951356887
Precision@100.72740315638451
Precision@250.71015843429637
Precision@500.6686121919585
Precision@1000.63267189663129
Recall0.66367521367521

file_text

F1 Score/
Precision@10/
Precision@25/
Precision@50/
Precision@100/
Recall0

redirect.title

F1 Score0.24549810011565
Precision@100.62626262626263
Precision@250.60582010582011
Precision@500.59042553191489
Precision@1000.54961832061069
Recall0.15876068376068

suggest

F1 Score0.63204396078009
Precision@100.77563329312425
Precision@250.72406015037594
Precision@500.69848975188781
Precision@1000.65804140127389
Recall0.6267094017094

text

F1 Score0.34211279702262
Precision@100.64285714285714
Precision@250.61440677966102
Precision@500.5751953125
Precision@1000.52798894263994
Recall0.25534188034188