We arn't sure yet what size of rescore window will be appropriate for the machine learned ranking working. We will primarily be training against labeled data for the first 10 or 20 results that are displayed to users. We should train some models and evaluate their performance with different size rescore windows, from 20 up to perhaps a few hundred or a thousand results. In addition to what effect this has on the results, we should evaluate what the performance impact is of these larger rescore windows.
Description
Status | Subtype | Assigned | Task | ||
---|---|---|---|---|---|
Invalid | None | T174064 [FY 2017-18 Objective] Implement advanced search methodologies | |||
Resolved | EBernhardson | T161632 [Epic] Improve search by researching and deploying machine learning to re-rank search results | |||
Resolved | EBernhardson | T162369 Evaluate rescore windows for learning to rank | |||
Resolved | EBernhardson | T150032 Add support for interleaved results in 2-way A/B test | |||
Resolved | debt | T171212 Interleaved results A/B test: turn on | |||
Resolved | EBernhardson | T171213 Interleaved results A/B test: check that data is flowing the way we expect | |||
Resolved | debt | T171214 Interleaved results A/B test: turn off test | |||
Resolved | mpopov | T171215 Interleaved results A/B test: analysis of data | |||
Declined | EBernhardson | T171984 Turn on test of LTR with standard AB buckets and an interleaved bucket. |
Event Timeline
This is partially being done as part of the AB test that just finished, by having buckets for rescore windows of 20 and 1024. A more analytical approach may be worthwhile as well though.
AB test results for 20 and 1024 are roughly similar. Should we run tests on other sizes? We have load tested up to 4096 with the capacity to run 200% of current traffic at 4096. Latency does increase as we increase the rescore window size.
It would depend on how often things below the top 20 move into the top 20 in practice, not just in theory. We can use the search logs to find this out, no?
I'll run some numbers on relevance forge to see what the practical effect is using some sampling of user queries (1k? 10k? i dunno).
Using 3304 queries from 2000 sessions in TestSearchSatisfaction between 9/15 and 9/30. Sorted and unsorted %'s are the same, because individual documents get the exact same score at new rescore window sizes, the only difference is if new documents outside the base rescore window get pulled into the top N results.
ZRR: 10.7%
Poorly performing: 16.2%
base rescore | increased rescore | top 1 | top 3 | top 5 | top 20 |
20 | 512 | 2.1% | 10.6% | 18.3% | 50.1% |
20 | 1024 | 2.1% | 11.0% | 18.7% | 50.1% |
20 | 2048 | 2.2% | 11.1% | 18.9% | 50.2% |
20 | 4096 | 2.2% | 11.2% | 18.9% | 50.2% |
512 | 1024 | 0.2% | 1.6% | 3.3% | 13.4% |
512 | 2048 | 0.5% | 2.3% | 4.5% | 14.9% |
512 | 4096 | 0.5% | 2.7% | 5.0% | 15.1% |
1024 | 2048 | 0.3% | 1.5% | 2.8% | 9.4% |
1024 | 4096 | 0.3% | 2.0% | 3.6% | 10.0% |
2048 | 4096 | 0.1% | 1.1% | 2.0% | 6.2% |
I manually reviewed 20 queries that had a difference in the top 5 results for 1024 vs 4096. In my (somewhat arbitrary) opinion only one of those queries improved. Others mostly seemed to pull up somewhat popular pages that weren't any more relevant to the query. Based on this i think we should continue with the current rescore window rather than expanding it. I also reviewed the 20 queries with changes to top 5 between 512 and 1024. This is a little murkier, some queries seem like they might be improving while others are not. I'm not sure the effect size is large enough we could run a good AB test though.