Quick RelForge analysis of impact of initial LTR model
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	TJones
	Jun 28 2017, 3:13 PM

Description

After setting up a new LTR enwiki snapshot ( http://en-wp-ltr-0617-relforge.wmflabs.org ), I thought it would be interesting to see how many queries have their results significantly changed by LTR.

Related Objects
Search...

Status	Assigned	Task
Invalid	None	T174064 [FY 2017-18 Objective] Implement advanced search methodologies
Resolved	EBernhardson	T161632 [Epic] Improve search by researching and deploying machine learning to re-rank search results
Resolved	TJones	T169103 Quick RelForge analysis of impact of initial LTR model

Event Timeline

TJones created this task.Jun 28 2017, 3:13 PM

TL;DR: Most queries will get different results from LTR; lots of queries are crappy and don't have any good results. LTR tends to give those negative scores.

Using RelForge, I ran a set of 7K queries from April 2017 against the 10-feature LTR-enabled enwiki instance in Labs ( http://en-wp-ltr-0617-relforge.wmflabs.org, built from a June 2017 enwiki snapshot), and the same 7K queries with the enwiki production config against the same instance.

One very long query with several asterisks was interpreted as a regular expression and failed.

As expected, the zero results rate (22.0%) and poorly performing (< 3 results) percentage (27.9%) stayed the same between the two runs.

The overall impact of the LTR model is huge. Roughly, 30% of top results differ, and 65-70% of top 3, top 5, and top 20 results differ*, with 8-10% fewer changes if you ignore re-ordering:

Top 1 Result Differs: 30.0%

Top 3 Sorted Results Differ: 65.7%
Top 3 Unsorted Results Differ: 57.0%

Top 5 Sorted Results Differ: 71.0%
Top 5 Unsorted Results Differ: 63.1%

Top 20 Sorted Results Differ: 71.6%
Top 20 Unsorted Results Differ: 61.2%

*For those not familiar with RelForge, "Top 5 Sorted Results Differ" is the percentage of queries where the top 5 results are not identical. It could be two results swapped, or all the results are completely different. I have more detailed stats on how many results differ per query if anyone wants to dig into them.

Note that 22% ZRR and 71% differing top-5 results accounts for 93% of queries. So only 7% of queries get the exact same results top 5 results. I'm curious how many of those queries have a very small number of results, but RelForge doesn't give up that info readily.

I reviewed RelForge’s random selection of 20 queries where the top results differ. My subjective groupings:

10 queries didn’t have a good result in either the prod or LTR config. These all had negative scores for the top result in LTR.
1 query similarly had no good result, and got a very low score (rather than negative score) in LTR.
2 queries got better results from LTR. These also had negative scores.
2 queries had results that only differed by shuffling the top 3 to 5 results.
3 queries got worse results from LTR, and LTR had low scores for the top result (< 1.5).
1 query had worse results from LTR.
1 query I couldn’t figure out the intent of, and gave up.

I wonder if we can get some value out of the score of the top LTR result. If it’s really negative, maybe we go ahead and show the Did You Mean results, if there are any. Maybe we don’t show results that score below a certain threshold if they aren’t in the top n for some value of n.

Of course, crappy query ⇒ neg score doesn't mean that neg score ⇒ crappy query. We’d have to see if that's true, and if these patterns hold up as we improve the LTR models, but it's interesting to think about.

TJones edited projects, added Discovery-Search (Current work); removed Discovery-Search.Jun 28 2017, 3:16 PM

TJones moved this task from Incoming to Needs review on the Discovery-Search (Current work) board.

Very interesting, @TJones, next steps? :)

Not sure what the next steps are. I discussed the results with Erik, David, and Daniel earlier today.

It's not really unexpected, since the ranking done by prod and LTR are completely different algorithms, it's not too surprising they disagree a lot on the specifics. A/B tests and letting users try it out in labs—once we have a more mature model—are definitely necessary.

TJones moved this task from Needs review to Needs Reporting on the Discovery-Search (Current work) board.Jul 11 2017, 5:08 PM

Sounds like some upcoming A/B tests are needed! :)

Quick RelForge analysis of impact of initial LTR modelClosed, ResolvedPublicActions

Description

Related ObjectsSearch...

Event Timeline

Quick RelForge analysis of impact of initial LTR model
Closed, ResolvedPublic
Actions

Related Objects
Search...