experiment with different label scales for MLR
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	EBernhardson
	Jan 3 2018, 6:02 PM

Description

Currently we use a scale of 0-9 for our mlr labels. The NDCG calculation is essentially 2^r / log2 (i+1), where r = relevance label and i = result position. This means that high labels significantly outweigh lower labels. We should experiment with different scales, perhaps 0-4 or 0-3 which is much more typical in the literature. We could also experiment with fractional labels (3.25, etc), although these are not directly supported by xgboost.

Event Timeline

EBernhardson created this task.Jan 3 2018, 6:02 PM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptJan 3 2018, 6:02 PM

We'll add this into the tests and experiment with (just a line or two).

EBernhardson claimed this task.Jan 17 2018, 10:47 PM

EBernhardson moved this task from Up Next to Current work on the Discovery-Search board.

EBernhardson edited projects, added Discovery-Search (Current work); removed Discovery-Search.

EBernhardson moved this task from Incoming to not in use - please delete on the Discovery-Search (Current work) board.

EBernhardson added a comment.Jan 17 2018, 10:55 PM

This comment was removed by EBernhardson.

EBernhardson added a comment.Jan 18 2018, 3:56 AM

This comment was removed by EBernhardson.

I realized the two posts above had serious leakage between test and train sets, so here we go again this time we split the training data into 3 folds, we train against each fold a model with labels on a 0-9 scale and a model with labels on a 0-3 scale, then we evaluate both models against both scales

train scale	test scale	cv-test-ndcg@10
ls10	ls4	0.8480
ls10	ls10	0.8436
ls4	ls4	0.8458
ls4	ls10	0.8500

The overall premise doesn't seem to hold, changing the scale from 0-9 to 0-3 doesn't do anything to explain why we are so high in the ndcg range. The variances here are small, I'll pull the models over to relforge and run a quick eval to see if things change much, but i'm not expecting a large change.

Looks like this experiment has run its course, nothing further to do.

experiment with different label scales for MLRClosed, ResolvedPublicActions

Description

Event Timeline

experiment with different label scales for MLR
Closed, ResolvedPublic
Actions