Investigate applying PDGD to our datasets
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	EBernhardson
	Nov 19 2019, 11:07 PM

Description

PDGD looks to be a weighting method applied to standard pairwise LTR. It limits the pairs that are compared and sets weights to them based on user click throughs. We won't be able to put this in front of real users, but we can evaluate how it works when providing clicks via a click model, similar to how the linked paper evalutes things.

https://arxiv.org/pdf/1901.10262.pdf

Related Objects
Search...

		Status	Subtype	Assigned	Task
		Open		None	T235857 Learning to Rank (LTR) applied to additional languages and projects to improve ranking (needs experimentation, might not work at all)
		Resolved		EBernhardson	T238703 Investigate applying PDGD to our datasets

Event Timeline

EBernhardson created this task.Nov 19 2019, 11:07 PM

Should have created this task before, did most of the investigation last week. I applied this technique both with a linear model and a neural model to one of our frwiki folds. This contains ~140k queries to train against and another 40k for evaluation. Our standard xgboost models trained over this dataset achieved ndcg@10 of 0.888. For reference the labels used here were estimated with a DBN click model and required seeing the same query at least 5 times within the prior 12 weeks.

I should caveat that this is intended as an online LTR solution, where it is trained not off of labels but from users interacting with results provided by the algorithm. In this case we are simulating clicks based on the dbn relevancy prediction, <30% relevance never clicks, the remaining labels generate clicks at the same rate as their relevancy. All simulated users see the complete result list, no early stopping was implemented.

The end result is training a linear model to ndcg@10 of 0.81. This is plausible, but the results[1] not particularly good. Also tried training a neural model, both a single 64-node hidden layer, and one with hidden layers of 64, 32 and 16. Tested both sigmoid and relu activation. The multiple versions train to approximately the same ndcg@10, although this is hard to judge as the score is not particularly stable, showing deviations of +- 0.05 ove 1k queries presented. At the end of the day, didn't get much better than 0.82 out of the neural models. ndcg@1 and ndcg@3 show similar relationships. For comparison training the same shape of neural model with same input data in tensorflow_ranking with their approximate ndcg loss achieves ndcg@10 of ~0.86. And as earlier, xgboost achieves 0.888 with 100 trees.

At the end of the day, this is interesting but I don't think it's going to be something we can move forward yet. In quick tests this didn't seem to do very well on busy wikis, and I did a quick test with ~1k queries, we've deployed reasonable xgboost ranking models with 1k, but end results were much worse. We could likely get better performance out of PDGD by using a click model more suited to this task, and directly using it rather than the current approach of estimating a click model based on the relevance output of another click model. On wikis with enough data I expect spending more time with data sampling to decide what to train off of could also help, but I don't think it would move a significant amount. Our overall goal is to find ways to deploy improved result ranking to wikis with less data available, and this isn't looking like the current approach to pdgd will need less data.

[1] https://fr.wikipedia.org/wiki/?search=~kennedy&cirrusMLRModel=frwiki_20191029_pdgd_v4&ns0=1

EBernhardson claimed this task.Nov 19 2019, 11:33 PM

EBernhardson moved this task from Incoming to Needs Reporting on the Discovery-Search (Current work) board.

TJones closed this task as Resolved.Nov 20 2019, 4:56 PM

Investigate applying PDGD to our datasetsClosed, ResolvedPublicActions

Description

Related ObjectsSearch...

Event Timeline

Investigate applying PDGD to our datasets
Closed, ResolvedPublic
Actions

Related Objects
Search...