Add autocomplete evaluation via MRR to relforge
Closed, ResolvedPublic

Description

Relforge has various support for running queries and calculating metrics over the results, augment it to support autocomplete as well.

Requirements:

  • Support autocomplete type for runSearch.php
  • Source queries for calculating MRR from eventlogging
  • Pre-process queries to determine all the prefixes we need to search and only search each once. This might get a bit ugly, everything prior assumes the queries are a flat list and don't share any information.
  • Implement MRR metric against autocomplete results

I have a working example of this now, will need some cleanup and test cases written before uploading to gerrit.

Initial results, data sourced from Sept 20 through 26.

entity typelanguage# clicksMRRMPC MRR
itemen5k0.190.41
itemen(all) 26k0.180.37
itemde(all) 4k0.190.39
propertyen5k0.260.43
propertyde(all) 1.1k0.320.46

I'm not sure these numbers are particularly meaningful on their own, the intended use is to measure relative different between two search engines. The current data suggests MPC would be significantly better than prod but is a completely unfair comparison. MPC here is massively overfit by only considering results that were clicked. So while prod is sifting through 10's of millions of items, mpc is considering at most a few thousand that received clickthroughs.

Thinking about this, MPC MRR might be considered the optimal ordering if we don't have the ability to use additional context per query. If my line of thinking is correct MPC should be the maximum possible MRR on this click dataset if we have to return the same result set for the same prefix every time. MPC could be significantly improved on if short prefixes could vary their results based on some sort of context clues.

Generally, prefix returns can change because they depend on entity weights, which can change with time. But if we fix entity weights, then same prefix should always return same result, at least in current setup.

EBernhardson added a comment.EditedOct 1 2018, 5:09 PM

After re-reviewing my code this morning I found a bug where all of the scores were off by one (so first position was 1/2 instead of 1/1). Re-running gives much higher numbers, but in relative terms things are pretty similar

entity typelanguage# clicksMRRMPC MRR
itemen5k0.350.77
itemde(all) 4k0.350.73
propertyen5k0.460.81
propertyde(all) 1.1k0.580.89

It makes sense that particularly for small datasets like property/de the MPC MRR should be quite high, it's significantly overfit and the result sets for most prefixes are probably quite low.

Change 463800 had a related patch set uploaded (by EBernhardson; owner: EBernhardson):
[wikimedia/discovery/relevanceForge@master] Add autocomplete to engineScore.py

https://gerrit.wikimedia.org/r/463800

TJones added a comment.Oct 5 2018, 8:04 PM

Are you evaluating MPC MRR based on re-ordering results "optimally" and scoring, which is indeed overfitted, or are you sorting results based on some data and evaluating on other data? My guess is that it would still do very well, because the most popular thing is going to be popular, but it could also be strongly overfitting on a longer tail that boosts the score a little here and a little there. It could also give a big boost to unique queries, which would always score perfectly, since there is no room for disagreement—and that long tail could make a big difference.

Change 463800 merged by jenkins-bot:
[wikimedia/discovery/relevanceForge@master] Add autocomplete to engineScore.py

https://gerrit.wikimedia.org/r/463800

Are you evaluating MPC MRR based on re-ordering results "optimally" and scoring, which is indeed overfitted, or are you sorting results based on some data and evaluating on other data? My guess is that it would still do very well, because the most popular thing is going to be popular, but it could also be strongly overfitting on a longer tail that boosts the score a little here and a little there. It could also give a big boost to unique queries, which would always score perfectly, since there is no room for disagreement—and that long tail could make a big difference.

MPC MRR is based on the optimal sort over the whole set, rather than splitting into a set to fit on and a set to evaluate over. I suppose nothing prevents that, but my assumption was that our current datasets for wikidata aren't large/diverse enough to support splitting, as many results would likely only be on one side. It should be pretty easy to split though, I'll parametrize the metric so we can set a random test/train split.

Change 465787 had a related patch set uploaded (by EBernhardson; owner: EBernhardson):
[wikimedia/discovery/relevanceForge@master] Use a test/train split for MPC MRR

https://gerrit.wikimedia.org/r/465787

In a quick test:

entity typelanguage% traintest queriesMPC MRR
itemde20%17140.33
itemde50%12270.41
itemde80%5980.44
itemde99%380.51
itemdeno split20080.73

Thanks for the data!

The 99% split might be a bit unfair—it's always possible that with only 38 test queries, you got a particularly hard set or a particularly easy set. The 80/20 split seems the most likely to be predictive at this volume—though a 90/10 or 95/5 split could be, too, if the number of queries being tested was large enough.

debt added a subscriber: debt.

Moving this back to 'needs review' as it looks like a bit more work is needing to be done? Maybe? :)

Change 465787 merged by jenkins-bot:
[wikimedia/discovery/relevanceForge@master] Use a test/train split for MPC MRR

https://gerrit.wikimedia.org/r/465787

debt closed this task as Resolved.Nov 2 2018, 10:08 PM