Page MenuHomePhabricator

[Epic] Improve mediasearch by using labelled data to create a model using elasticsearch learning-to-rank
Closed, ResolvedPublic

Description

See https://elasticsearch-learning-to-rank.readthedocs.io/en/latest/core-concepts.html

The basic steps here are

  1. create an elasticsearch featureset that calculates scores for the different search signals in a mediasearch query (T271806)
  2. run every query for which we have "good" data, and gather the scores into a format that can be used to train a model
  3. train a model using the gathered scores (see https://elasticsearch-learning-to-rank.readthedocs.io/en/latest/training-models.html) - probably this will need to be done by the search team
  4. add a new profile to mediasearch that uses the trained model in a rescore query (see T274670), and test it (see T271801)

Event Timeline

CBogen renamed this task from Improve mediasearch by using labelled data to create a model using elasticsearch learning-to-rank to [Epic] Improve mediasearch by using labelled data to create a model using elasticsearch learning-to-rank.Jan 13 2021, 6:07 PM
CBogen added a project: Epic.

I trained a model using the dataset provided by @Cparle, it has 8k obvservations and 15 features from the ltr featureset named MediaSearch_20210127 uploaded to the cloudelastic server.

Labels are: -1 (bad), 0 (meh), 1 (OK).
The idea is to use a binary classifier and use it's prediction probability as the score of the model.
I used xgboost as it's format is supported by the ltr plugin.
I transformed the labels to binary classes [0, 1]: considering meh labels as 0.
Best params

{
    'booster': 'gbtree',
    'eta': 0.1,
    'eval_metric': 'logloss',
    'max_depth': 4,
    'num_boost_round': 20,
    'objective': 'binary:logistic'
}

Feature importance:

myplot.png (480×640 px, 32 KB)

  • weight is how many time the feature is used as a split
  • gain is how much the model improved adding a split with this feature

2 features out of 15 are completely useless:

  • match_category_plain: this feature queries the same field category as the match_category feature.
  • match_suggest_plain: the suggest.plain field does not exist

and should be removed or fixed.

The generated model has been uploaded to the ltr plugin on cloudelastic with the name MediaSearch_20210127_xgboost_v1_20t_4d.
Code: https://github.com/nomoa/mediasearch_tuning.

8k obversations sound little even if it's unclear to me how to prove it. But if using this binary prediction approach is proven to work it might be interesting to investigate using click data to have more observations.

CBogen added a subscriber: CBogen.

Resolving because all subtasks are complete.

@dcausse it turns out this model can't be used with languages for which we don't have stemming, so I prepared another ranklib file that uses the .plain fields (where available)

The featureset is on the cloudelastic server and it's called MediaSearch_20210826. New ranklib file attached - could you generate another model for us please?

@dcausse it turns out this model can't be used with languages for which we don't have stemming, so I prepared another ranklib file that uses the .plain fields (where available)

The featureset is on the cloudelastic server and it's called MediaSearch_20210826. New ranklib file attached - could you generate another model for us please?

Sure,
here is the model: https://raw.githubusercontent.com/nomoa/mediasearch_tuning/main/MediaSearch_20210826_xgboost_v1_34t_4d.json
and the feature importance graph: https://raw.githubusercontent.com/nomoa/mediasearch_tuning/main/feature_importance_2021826_xgboost_map_v1_34t_4d.png