Maniphest T205494

Add autocomplete evaluation via MRR to relforge
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	EBernhardson
	Sep 25 2018, 9:40 PM

Description

Relforge has various support for running queries and calculating metrics over the results, augment it to support autocomplete as well.

Requirements:

Support autocomplete type for runSearch.php
Source queries for calculating MRR from eventlogging
Pre-process queries to determine all the prefixes we need to search and only search each once. This might get a bit ugly, everything prior assumes the queries are a flat list and don't share any information.
Implement MRR metric against autocomplete results

Details

	Subject	Repo	Branch	Lines +/-
	Use a test/train split for MPC MRR	wikimedia/discovery/relevanceForge	master	+40 -11
	Add autocomplete to engineScore.py	wikimedia/discovery/relevanceForge	master	+250 -14

Customize query in gerrit

Related Objects
Search...

Status	Assigned	Task
Resolved	debt	T193701 Explore using user clicks data to tune Wikidata search parameters
Resolved	debt	T205111 [EPIC] Transform wikidata autocomplete click logs into a useful dataset
Resolved	EBernhardson	T205494 Add autocomplete evaluation via MRR to relforge

Event Timeline

EBernhardson created this task.Sep 25 2018, 9:40 PM

I have a working example of this now, will need some cleanup and test cases written before uploading to gerrit.

Initial results, data sourced from Sept 20 through 26.

entity type	language	# clicks	MRR	MPC MRR
item	en	5k	0.19	0.41
item	en	(all) 26k	0.18	0.37
item	de	(all) 4k	0.19	0.39
property	en	5k	0.26	0.43
property	de	(all) 1.1k	0.32	0.46

I'm not sure these numbers are particularly meaningful on their own, the intended use is to measure relative different between two search engines. The current data suggests MPC would be significantly better than prod but is a completely unfair comparison. MPC here is massively overfit by only considering results that were clicked. So while prod is sifting through 10's of millions of items, mpc is considering at most a few thousand that received clickthroughs.

Thinking about this, MPC MRR might be considered the optimal ordering if we don't have the ability to use additional context per query. If my line of thinking is correct MPC should be the maximum possible MRR on this click dataset if we have to return the same result set for the same prefix every time. MPC could be significantly improved on if short prefixes could vary their results based on some sort of context clues.

Generally, prefix returns can change because they depend on entity weights, which can change with time. But if we fix entity weights, then same prefix should always return same result, at least in current setup.

After re-reviewing my code this morning I found a bug where all of the scores were off by one (so first position was 1/2 instead of 1/1). Re-running gives much higher numbers, but in relative terms things are pretty similar

entity type	language	# clicks	MRR	MPC MRR
item	en	5k	0.35	0.77
item	de	(all) 4k	0.35	0.73
property	en	5k	0.46	0.81
property	de	(all) 1.1k	0.58	0.89

It makes sense that particularly for small datasets like property/de the MPC MRR should be quite high, it's significantly overfit and the result sets for most prefixes are probably quite low.

Change 463800 had a related patch set uploaded (by EBernhardson; owner: EBernhardson):
[wikimedia/discovery/relevanceForge@master] Add autocomplete to engineScore.py

https://gerrit.wikimedia.org/r/463800

gerritbot added a project: Patch-For-Review.Oct 1 2018, 5:19 PM

EBernhardson moved this task from Incoming to Needs review on the Discovery-Search (Current work) board.Oct 1 2018, 5:21 PM

Are you evaluating MPC MRR based on re-ordering results "optimally" and scoring, which is indeed overfitted, or are you sorting results based on some data and evaluating on other data? My guess is that it would still do very well, because the most popular thing is going to be popular, but it could also be strongly overfitting on a longer tail that boosts the score a little here and a little there. It could also give a big boost to unique queries, which would always score perfectly, since there is no room for disagreement—and that long tail could make a big difference.

Change 463800 merged by jenkins-bot:
[wikimedia/discovery/relevanceForge@master] Add autocomplete to engineScore.py

https://gerrit.wikimedia.org/r/463800

EBernhardson moved this task from Needs review to Needs Reporting on the Discovery-Search (Current work) board.Oct 10 2018, 10:03 PM

In T205494#4646547, @TJones wrote:

Are you evaluating MPC MRR based on re-ordering results "optimally" and scoring, which is indeed overfitted, or are you sorting results based on some data and evaluating on other data? My guess is that it would still do very well, because the most popular thing is going to be popular, but it could also be strongly overfitting on a longer tail that boosts the score a little here and a little there. It could also give a big boost to unique queries, which would always score perfectly, since there is no room for disagreement—and that long tail could make a big difference.

MPC MRR is based on the optimal sort over the whole set, rather than splitting into a set to fit on and a set to evaluate over. I suppose nothing prevents that, but my assumption was that our current datasets for wikidata aren't large/diverse enough to support splitting, as many results would likely only be on one side. It should be pretty easy to split though, I'll parametrize the metric so we can set a random test/train split.

Change 465787 had a related patch set uploaded (by EBernhardson; owner: EBernhardson):
[wikimedia/discovery/relevanceForge@master] Use a test/train split for MPC MRR

https://gerrit.wikimedia.org/r/465787

In a quick test:

entity type	language	% train	test queries	MPC MRR
item	de	20%	1714	0.33
item	de	50%	1227	0.41
item	de	80%	598	0.44
item	de	99%	38	0.51
item	de	no split	2008	0.73

Thanks for the data!

The 99% split might be a bit unfair—it's always possible that with only 38 test queries, you got a particularly hard set or a particularly easy set. The 80/20 split seems the most likely to be predictive at this volume—though a 90/10 or 95/5 split could be, too, if the number of queries being tested was large enough.

Moving this back to 'needs review' as it looks like a bit more work is needing to be done? Maybe? :)

Change 465787 merged by jenkins-bot:
[wikimedia/discovery/relevanceForge@master] Use a test/train split for MPC MRR

https://gerrit.wikimedia.org/r/465787

EBernhardson moved this task from Needs review to Needs Reporting on the Discovery-Search (Current work) board.Oct 29 2018, 4:48 PM

debt closed this task as Resolved.Nov 2 2018, 10:08 PM

Add autocomplete evaluation via MRR to relforgeClosed, ResolvedPublicActions

Description

Details

Related ObjectsSearch...

Event Timeline

Add autocomplete evaluation via MRR to relforge
Closed, ResolvedPublic
Actions

Related Objects
Search...