Currently we use a scale of 0-9 for our mlr labels. The NDCG calculation is essentially 2^r / log2 (i+1), where r = relevance label and i = result position. This means that high labels significantly outweigh lower labels. We should experiment with different scales, perhaps 0-4 or 0-3 which is much more typical in the literature. We could also experiment with fractional labels (3.25, etc), although these are not directly supported by xgboost.
Description
Description
Event Timeline
This comment was removed by EBernhardson.
This comment was removed by EBernhardson.
Comment Actions
I realized the two posts above had serious leakage between test and train sets, so here we go again this time we split the training data into 3 folds, we train against each fold a model with labels on a 0-9 scale and a model with labels on a 0-3 scale, then we evaluate both models against both scales
train scale | test scale | cv-test-ndcg@10 |
ls10 | ls4 | 0.8480 |
ls10 | ls10 | 0.8436 |
ls4 | ls4 | 0.8458 |
ls4 | ls10 | 0.8500 |
The overall premise doesn't seem to hold, changing the scale from 0-9 to 0-3 doesn't do anything to explain why we are so high in the ndcg range. The variances here are small, I'll pull the models over to relforge and run a quick eval to see if things change much, but i'm not expecting a large change.