Page MenuHomePhabricator

[L] Implement new search profile(s) based on image search signal results
Open, Needs TriagePublic

Description

Note: this ticket has been rewritten to reflect new analysis done in March 2021

We need to translate this into a new elastic search profile (create a new profile rather than changing the existing one, for now) with a query builder that will compute a probability-of-an-image-being-good based on the results of the logistic regressions, and return that as the score

Implementation

The probability of an image being good based on the elasticsearch scores for elasticsearch search field is

1 / ( 1 + exp( -1 * ( ( coefficient_for_field_A * score_for_field_A ) + ( coefficient_for_field_B * score_for_field_B ) + ... + intercept ) ) )
fieldcoefficient
descriptions0.019320230186222098
title0.0702949038300864
category0.05158078808882278
redirect.title0.01060150471482338
statements0.11098311564161133

Intercept is -1.1975600089068401

It's also probably a good idea to set title and auxiliary_text to a small non-zero number, just to preserve ordering if those are the only fields that match

This will need to be implemented using function_score or similar queries in elasticsearch

Testing

See T271801 for how to test each profile that we construct in this way and decide if it's better or worse

Event Timeline

Note: the formula in description will generate results between 0-1.
IIRC, some rescore functions (wsum_inclinks_pv, iirc?) don't multiply the original score, but perform an addition, in which case a score between 0-1 would be so low that it'd likely have almost no impact on the final score.
If that is an issue (which I don't know it would be - will need to be checked as part of this ticket), we might want to make sure larger scores are returned (e.g. simply multiplying by 100 might be enough? idk - would have to check expected range of rescore functions, probably)

CBogen renamed this task from Implement new search profile(s) based on image search signal results to [L] Implement new search profile(s) based on image search signal results .Jan 13 2021, 5:30 PM

Moving this into blocked for the minute, as we now have a more complete dataset that's being analysed atm, and would like that to complete before this is implemented

Moving this into blocked for the minute, as we now have a more complete dataset that's being analysed atm, and would like that to complete before this is implemented

Thanks @Cparle. Can you link to the blocking ticket?

@CBogen as discussed earlier today with @Cparle, @AikoChou and I are working on analyzing the data. @AikoChou will create a phab task with the description and results of the analysis she is doing (we just talked about it a couple of hours ago, and it's now late evening for her, so we will likely have it tomorrow morning or later today). I hope that works!

Moving this into blocked for the minute, as we now have a more complete dataset that's being analysed atm, and would like that to complete before this is implemented

Thanks @Cparle. Can you link to the blocking ticket?

@CBogen blocking ticket is T274225

Cparle updated the task description. (Show Details)

Change 666936 had a related patch set uploaded (by Matthias Mullie; owner: Matthias Mullie):
[mediawiki/extensions/WikibaseMediaInfo@master] Move stem/plain boosts into weights config

https://gerrit.wikimedia.org/r/666936

Change 666936 merged by jenkins-bot:
[mediawiki/extensions/WikibaseMediaInfo@master] Move stem/plain boosts into weights config

https://gerrit.wikimedia.org/r/666936

Change 668115 had a related patch set uploaded (by Cparle; owner: Cparle):
[mediawiki/extensions/WikibaseMediaInfo@master] New search profile based on labeled data

https://gerrit.wikimedia.org/r/668115

@matthiasmullie run the scripts in https://github.com/cormacparle/media-search-signal-test with the current profile and the new one and here are the results

control (current profile):
F1 Score | 0.63937201785439
Precision@10 | 0.70054644808743
Precision@25 | 0.66403681788297
Precision@50 | 0.63254956201014
Precision@100 | 0.60326449033977
Recall | 0.89007928005142

regressions:
F1 Score | 0.645911360799
Precision@10 | 0.75289169295478
Precision@25 | 0.70538057742782
Precision@50 | 0.66197843413033
Precision@100 | 0.62099223759703
Recall | 0.88440170940171

F1 Score and Recall can probably be ignored, as we haven't made any changes that add/remove files from the results set - but this demonstrates positive changes in precision at every level

Here's the analysis run on March 25 comparing the new profile (with the updated coefficients above) with the existing profile

42b1f48f3587a0a0c4209387fdcfc924f3e9e4f7 with mediasearch_logistic_regression ON

F1 Score      | 0.58886327793279
Precision@10  | 0.82741116751269
Precision@25  | 0.78717201166181
Precision@50  | 0.76232275489534
Precision@100 | 0.72828723920427
Recall        | 0.52396373056995

42b1f48f3587a0a0c4209387fdcfc924f3e9e4f7 with mediasearch_logistic_regression OFF

F1 Score      | 0.5951526032316
Precision@10  | 0.7774343122102
Precision@25  | 0.74845542806708
Precision@50  | 0.72547846889952
Precision@100 | 0.69389473684211
Recall        | 0.57253886010363

Change 668115 merged by jenkins-bot:
[mediawiki/extensions/WikibaseMediaInfo@master] New search profile based on labeled data

https://gerrit.wikimedia.org/r/668115

This search profile has been merged, but is not yet the default. It needs an explicit flag to be triggered, which will allow us to manually verify that nothing unexpected happens once it's in prod.
Moving back to "ready for dev" - all it needs is another patch that replaces the current default config with this one.

Change 681004 had a related patch set uploaded (by Cparle; author: Cparle):

[mediawiki/extensions/WikibaseMediaInfo@master] Make the logistic regression image search default

https://gerrit.wikimedia.org/r/681004

Change 681709 had a related patch set uploaded (by Cparle; author: Cparle):

[mediawiki/extensions/WikibaseMediaInfo@wmf/1.37.0-wmf.1] Make the logistic regression image search default

https://gerrit.wikimedia.org/r/681709

Change 681709 merged by jenkins-bot:

[mediawiki/extensions/WikibaseMediaInfo@wmf/1.37.0-wmf.1] Make the logistic regression image search default

https://gerrit.wikimedia.org/r/681709

Mentioned in SAL (#wikimedia-operations) [2021-04-21T18:42:01Z] <urbanecm@deploy1002> Synchronized php-1.37.0-wmf.1/extensions/WikibaseMediaInfo/: f831d16e42e712832d683233a5b21ad59f7c73b3: Make the logistic regression image search default (T271799) (duration: 00m 58s)