Page MenuHomePhabricator

Investigate applying PDGD to our datasets
Closed, ResolvedPublic

Description

PDGD looks to be a weighting method applied to standard pairwise LTR. It limits the pairs that are compared and sets weights to them based on user click throughs. We won't be able to put this in front of real users, but we can evaluate how it works when providing clicks via a click model, similar to how the linked paper evalutes things.

https://arxiv.org/pdf/1901.10262.pdf

Event Timeline

Should have created this task before, did most of the investigation last week. I applied this technique both with a linear model and a neural model to one of our frwiki folds. This contains ~140k queries to train against and another 40k for evaluation. Our standard xgboost models trained over this dataset achieved ndcg@10 of 0.888. For reference the labels used here were estimated with a DBN click model and required seeing the same query at least 5 times within the prior 12 weeks.

I should caveat that this is intended as an online LTR solution, where it is trained not off of labels but from users interacting with results provided by the algorithm. In this case we are simulating clicks based on the dbn relevancy prediction, <30% relevance never clicks, the remaining labels generate clicks at the same rate as their relevancy. All simulated users see the complete result list, no early stopping was implemented.

The end result is training a linear model to ndcg@10 of 0.81. This is plausible, but the results[1] not particularly good. Also tried training a neural model, both a single 64-node hidden layer, and one with hidden layers of 64, 32 and 16. Tested both sigmoid and relu activation. The multiple versions train to approximately the same ndcg@10, although this is hard to judge as the score is not particularly stable, showing deviations of +- 0.05 ove 1k queries presented. At the end of the day, didn't get much better than 0.82 out of the neural models. ndcg@1 and ndcg@3 show similar relationships. For comparison training the same shape of neural model with same input data in tensorflow_ranking with their approximate ndcg loss achieves ndcg@10 of ~0.86. And as earlier, xgboost achieves 0.888 with 100 trees.

At the end of the day, this is interesting but I don't think it's going to be something we can move forward yet. In quick tests this didn't seem to do very well on busy wikis, and I did a quick test with ~1k queries, we've deployed reasonable xgboost ranking models with 1k, but end results were much worse. We could likely get better performance out of PDGD by using a click model more suited to this task, and directly using it rather than the current approach of estimating a click model based on the relevance output of another click model. On wikis with enough data I expect spending more time with data sampling to decide what to train off of could also help, but I don't think it would move a significant amount. Our overall goal is to find ways to deploy improved result ranking to wikis with less data available, and this isn't looking like the current approach to pdgd will need less data.

[1] https://fr.wikipedia.org/wiki/?search=~kennedy&cirrusMLRModel=frwiki_20191029_pdgd_v4&ns0=1