Page MenuHomePhabricator

Evaluate rescore windows for learning to rank
Closed, ResolvedPublic

Description

We arn't sure yet what size of rescore window will be appropriate for the machine learned ranking working. We will primarily be training against labeled data for the first 10 or 20 results that are displayed to users. We should train some models and evaluate their performance with different size rescore windows, from 20 up to perhaps a few hundred or a thousand results. In addition to what effect this has on the results, we should evaluate what the performance impact is of these larger rescore windows.

Event Timeline

This is partially being done as part of the AB test that just finished, by having buckets for rescore windows of 20 and 1024. A more analytical approach may be worthwhile as well though.

AB test results for 20 and 1024 are roughly similar. Should we run tests on other sizes? We have load tested up to 4096 with the capacity to run 200% of current traffic at 4096. Latency does increase as we increase the rescore window size.

Hi @mpopov and @chelsyx - do you think that running more tests with different rescore sizes would get us useful information?

It would depend on how often things below the top 20 move into the top 20 in practice, not just in theory. We can use the search logs to find this out, no?

I'll run some numbers on relevance forge to see what the practical effect is using some sampling of user queries (1k? 10k? i dunno).

Using 3304 queries from 2000 sessions in TestSearchSatisfaction between 9/15 and 9/30. Sorted and unsorted %'s are the same, because individual documents get the exact same score at new rescore window sizes, the only difference is if new documents outside the base rescore window get pulled into the top N results.

ZRR: 10.7%
Poorly performing: 16.2%

base rescoreincreased rescoretop 1top 3top 5top 20
205122.1%10.6%18.3%50.1%
2010242.1%11.0%18.7%50.1%
2020482.2%11.1%18.9%50.2%
2040962.2%11.2%18.9%50.2%
51210240.2%1.6%3.3%13.4%
51220480.5%2.3%4.5%14.9%
51240960.5%2.7%5.0%15.1%
102420480.3%1.5%2.8%9.4%
102440960.3%2.0%3.6%10.0%
204840960.1%1.1%2.0%6.2%

I manually reviewed 20 queries that had a difference in the top 5 results for 1024 vs 4096. In my (somewhat arbitrary) opinion only one of those queries improved. Others mostly seemed to pull up somewhat popular pages that weren't any more relevant to the query. Based on this i think we should continue with the current rescore window rather than expanding it. I also reviewed the 20 queries with changes to top 5 between 512 and 1024. This is a little murkier, some queries seem like they might be improving while others are not. I'm not sure the effect size is large enough we could run a good AB test though.