Page MenuHomePhabricator

Quick RelForge analysis of impact of initial LTR model
Closed, ResolvedPublic

Description

After setting up a new LTR enwiki snapshot ( http://en-wp-ltr-0617-relforge.wmflabs.org ), I thought it would be interesting to see how many queries have their results significantly changed by LTR.

Event Timeline

TL;DR: Most queries will get different results from LTR; lots of queries are crappy and don't have any good results. LTR tends to give those negative scores.

Using RelForge, I ran a set of 7K queries from April 2017 against the 10-feature LTR-enabled enwiki instance in Labs ( http://en-wp-ltr-0617-relforge.wmflabs.org, built from a June 2017 enwiki snapshot), and the same 7K queries with the enwiki production config against the same instance.

One very long query with several asterisks was interpreted as a regular expression and failed.

As expected, the zero results rate (22.0%) and poorly performing (< 3 results) percentage (27.9%) stayed the same between the two runs.

The overall impact of the LTR model is huge. Roughly, 30% of top results differ, and 65-70% of top 3, top 5, and top 20 results differ*, with 8-10% fewer changes if you ignore re-ordering:

Top 1 Result Differs: 30.0%

Top 3 Sorted Results Differ: 65.7%
Top 3 Unsorted Results Differ: 57.0%

Top 5 Sorted Results Differ: 71.0%
Top 5 Unsorted Results Differ: 63.1%

Top 20 Sorted Results Differ: 71.6%
Top 20 Unsorted Results Differ: 61.2%

*For those not familiar with RelForge, "Top 5 Sorted Results Differ" is the percentage of queries where the top 5 results are not identical. It could be two results swapped, or all the results are completely different. I have more detailed stats on how many results differ per query if anyone wants to dig into them.

Note that 22% ZRR and 71% differing top-5 results accounts for 93% of queries. So only 7% of queries get the exact same results top 5 results. I'm curious how many of those queries have a very small number of results, but RelForge doesn't give up that info readily.

I reviewed RelForge’s random selection of 20 queries where the top results differ. My subjective groupings:

  • 10 queries didn’t have a good result in either the prod or LTR config. These all had negative scores for the top result in LTR.
  • 1 query similarly had no good result, and got a very low score (rather than negative score) in LTR.
  • 2 queries got better results from LTR. These also had negative scores.
  • 2 queries had results that only differed by shuffling the top 3 to 5 results.
  • 3 queries got worse results from LTR, and LTR had low scores for the top result (< 1.5).
  • 1 query had worse results from LTR.
  • 1 query I couldn’t figure out the intent of, and gave up.

I wonder if we can get some value out of the score of the top LTR result. If it’s really negative, maybe we go ahead and show the Did You Mean results, if there are any. Maybe we don’t show results that score below a certain threshold if they aren’t in the top n for some value of n.

Of course, crappy query ⇒ neg score doesn't mean that neg score ⇒ crappy query. We’d have to see if that's true, and if these patterns hold up as we improve the LTR models, but it's interesting to think about.

Not sure what the next steps are. I discussed the results with Erik, David, and Daniel earlier today.

It's not really unexpected, since the ranking done by prod and LTR are completely different algorithms, it's not too surprising they disagree a lot on the specifics. A/B tests and letting users try it out in labs—once we have a more mature model—are definitely necessary.

Sounds like some upcoming A/B tests are needed! :)