Page MenuHomePhabricator

experiment with different label scales for MLR
Closed, ResolvedPublic

Description

Currently we use a scale of 0-9 for our mlr labels. The NDCG calculation is essentially 2^r / log2 (i+1), where r = relevance label and i = result position. This means that high labels significantly outweigh lower labels. We should experiment with different scales, perhaps 0-4 or 0-3 which is much more typical in the literature. We could also experiment with fractional labels (3.25, etc), although these are not directly supported by xgboost.

Event Timeline

debt triaged this task as Medium priority.Jan 4 2018, 6:04 PM
debt moved this task from needs triage to Up Next on the Discovery-Search board.
debt subscribed.

We'll add this into the tests and experiment with (just a line or two).

I realized the two posts above had serious leakage between test and train sets, so here we go again this time we split the training data into 3 folds, we train against each fold a model with labels on a 0-9 scale and a model with labels on a 0-3 scale, then we evaluate both models against both scales

train scaletest scalecv-test-ndcg@10
ls10ls40.8480
ls10ls100.8436
ls4ls40.8458
ls4ls100.8500

The overall premise doesn't seem to hold, changing the scale from 0-9 to 0-3 doesn't do anything to explain why we are so high in the ndcg range. The variances here are small, I'll pull the models over to relforge and run a quick eval to see if things change much, but i'm not expecting a large change.

EBjune subscribed.

Looks like this experiment has run its course, nothing further to do.