Page MenuHomePhabricator

[XL] Create a new profile for mediasearch that uses the trained LTR model in elasticsearch
Closed, ResolvedPublic

Description

On the cloudelastic elasticsearch replica we now have a learning-to-rank model trained using labeled image data we've collected (see https://phabricator.wikimedia.org/T271803#6823874)

The next step is to write a new search profile in WikibaseMediaInfo, activated by a url param, that uses the model. See https://elasticsearch-learning-to-rank.readthedocs.io/en/latest/searching-with-your-model.html for how to go about searching using a model

We probably ought not merge the profile for now (because the model is not on production), but test it by setting up a local dev environment's search url to point at cloudelastic* and running the AnalyzeResults script in https://github.com/cormacparle/media-search-signal-test against the local search api

Acceptance criteria:

  • a new search profile that uses the trained model
  • a set of results from the AnalyzeResults script (see T271801)

To set up your local environment to search using cloudelastic:

outside vagrant:
ssh -n -L0.0.0.0:9243:cloudelastic1001.wikimedia.org:9243 mwdebug1002.eqiad.wmnet "sleep 36000"

inside vagrant:
sudo sh -c 'echo "10.0.2.2    cloudelastic1001.wikimedia.org" >> /etc/hosts'

in LocalSettings.php

<?php                                                                                                    
                                                                                                         
$wgCirrusSearchClusters = [                                                                              
    'default' => [
        [
            'host' => 'cloudelastic1001.wikimedia.org',                                                  
            'port' => 9243,
            'transport' => 'Https'
        ]
];

// Activate devel options useful for relforge
$wgCirrusSearchDevelOptions = [ 
    'morelike_collect_titles_from_elastic' => true, 
    'ignore_missing_rev' => true, 
];

$wgCirrusSearchIndexBaseName = 'commonswiki';

$wgCirrusSearchNamespaceMappings[ NS_FILE ] = 'file';
// Undo global config that includes commons files in other wikis search results
unset( $wgCirrusSearchExtraIndexes[ NS_FILE ] );
WARNING: do not forget to close the ssh tunnel afterwards … it can cause problems if you provision vagrant

Event Timeline

CBogen renamed this task from Create a new profile for mediasearch that uses the trained model in elasticsearch to [XL] Create a new profile for mediasearch that uses the trained model in elasticsearch.Feb 24 2021, 5:31 PM

Results from AnalyzeResults.php:

Rescore using the model MediaSearch_20210826_xgboost_v2_34t_4d:
F1 Score | 0.31732441471572
Precision@10 | 0.71940298507463
Precision@25 | 0.65682656826568
Precision@50 | 0.62242268041237
Precision@100 | 0.57727272727273
Recall | 0.22263938426882
Average precision | 0.14204922421619

Empty rescore:
F1 Score | 0.6547619047619
Precision@10 | 0.85442574981712
Precision@25 | 0.83052688756111
Precision@50 | 0.80042372881356
Precision@100 | 0.76150487067518
Recall | 0.60916087854327
Average precision | 0.49789216683898

Surprisingly, the LTR model has a significantly worse performance than plain old logistic regression, which I suppose means we'll stick with what we have.

@CBogen @matthiasmullie @SWakiyama just pinging you to make sure you're aware of this before I close the ticket

Change 716238 had a related patch set uploaded (by Cparle; author: Cparle):

[mediawiki/extensions/CirrusSearch@master] Allow LTR model params to be set publicly

https://gerrit.wikimedia.org/r/716238

Change 716238 merged by jenkins-bot:

[mediawiki/extensions/CirrusSearch@master] Allow LTR model params to be set publicly

https://gerrit.wikimedia.org/r/716238

Change 716434 had a related patch set uploaded (by Cparle; author: Cparle):

[mediawiki/extensions/WikibaseMediaInfo@master] Add profile to rescore query based on an LTR model

https://gerrit.wikimedia.org/r/716434

Cparle renamed this task from [XL] Create a new profile for mediasearch that uses the trained model in elasticsearch to [XL] Create a new profile for mediasearch that uses the trained LTR model in elasticsearch.Sep 9 2021, 11:18 AM

Change 716434 merged by jenkins-bot:

[mediawiki/extensions/WikibaseMediaInfo@master] Add profile to rescore query based on an LTR model

https://gerrit.wikimedia.org/r/716434

Not moving forward with this for now - the code has been merged and reverted so it's in the commit log