Evaluate rescore windows for learning to rank
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	EBernhardson
	Apr 6 2017, 3:55 PM

Description

We arn't sure yet what size of rescore window will be appropriate for the machine learned ranking working. We will primarily be training against labeled data for the first 10 or 20 results that are displayed to users. We should train some models and evaluate their performance with different size rescore windows, from 20 up to perhaps a few hundred or a thousand results. In addition to what effect this has on the results, we should evaluate what the performance impact is of these larger rescore windows.

Related Objects
Search...

Status	Assigned	Task
Invalid	None	T174064 [FY 2017-18 Objective] Implement advanced search methodologies
Resolved	EBernhardson	T161632 [Epic] Improve search by researching and deploying machine learning to re-rank search results
Resolved	EBernhardson	T162369 Evaluate rescore windows for learning to rank
Resolved	EBernhardson	T150032 Add support for interleaved results in 2-way A/B test
Resolved	debt	T171212 Interleaved results A/B test: turn on
Resolved	EBernhardson	T171213 Interleaved results A/B test: check that data is flowing the way we expect
Resolved	debt	T171214 Interleaved results A/B test: turn off test
Resolved	mpopov	T171215 Interleaved results A/B test: analysis of data
Declined	EBernhardson	T171984 Turn on test of LTR with standard AB buckets and an interleaved bucket.

Event Timeline

EBernhardson created this task.Apr 6 2017, 3:55 PM

EBernhardson moved this task from Current work to Up Next on the Discovery-Search board.Apr 17 2017, 6:05 PM

EBernhardson edited projects, added Discovery-Search; removed Discovery-Search (Current work).

debt added a subtask: T150032: Add support for interleaved results in 2-way A/B test.Aug 3 2017, 9:52 PM

debt edited projects, added Discovery-Search (Current work); removed Discovery-Search.

This is partially being done as part of the AB test that just finished, by having buckets for rescore windows of 20 and 1024. A more analytical approach may be worthwhile as well though.

debt closed subtask T150032: Add support for interleaved results in 2-way A/B test as Resolved.Sep 22 2017, 1:59 PM

AB test results for 20 and 1024 are roughly similar. Should we run tests on other sizes? We have load tested up to 4096 with the capacity to run 200% of current traffic at 4096. Latency does increase as we increase the rescore window size.

Hi @mpopov and @chelsyx - do you think that running more tests with different rescore sizes would get us useful information?

It would depend on how often things below the top 20 move into the top 20 in practice, not just in theory. We can use the search logs to find this out, no?

I'll run some numbers on relevance forge to see what the practical effect is using some sampling of user queries (1k? 10k? i dunno).

Using 3304 queries from 2000 sessions in TestSearchSatisfaction between 9/15 and 9/30. Sorted and unsorted %'s are the same, because individual documents get the exact same score at new rescore window sizes, the only difference is if new documents outside the base rescore window get pulled into the top N results.

ZRR: 10.7%
Poorly performing: 16.2%

base rescore	increased rescore	top 1	top 3	top 5	top 20
20	512	2.1%	10.6%	18.3%	50.1%
20	1024	2.1%	11.0%	18.7%	50.1%
20	2048	2.2%	11.1%	18.9%	50.2%
20	4096	2.2%	11.2%	18.9%	50.2%
512	1024	0.2%	1.6%	3.3%	13.4%
512	2048	0.5%	2.3%	4.5%	14.9%
512	4096	0.5%	2.7%	5.0%	15.1%
1024	2048	0.3%	1.5%	2.8%	9.4%
1024	4096	0.3%	2.0%	3.6%	10.0%
2048	4096	0.1%	1.1%	2.0%	6.2%

debt moved this task from Incoming to not in use - please delete on the Discovery-Search (Current work) board.Oct 17 2017, 5:37 PM

I manually reviewed 20 queries that had a difference in the top 5 results for 1024 vs 4096. In my (somewhat arbitrary) opinion only one of those queries improved. Others mostly seemed to pull up somewhat popular pages that weren't any more relevant to the query. Based on this i think we should continue with the current rescore window rather than expanding it. I also reviewed the 20 queries with changes to top 5 between 512 and 1024. This is a little murkier, some queries seem like they might be improving while others are not. I'm not sure the effect size is large enough we could run a good AB test though.

EBernhardson moved this task from not in use - please delete to Needs Reporting on the Discovery-Search (Current work) board.Nov 13 2017, 5:01 PM

Thanks for the research, @EBernhardson!

debt closed this task as Resolved.Nov 16 2017, 5:43 PM

Evaluate rescore windows for learning to rankClosed, ResolvedPublicActions

Description

Related ObjectsSearch...

Event Timeline

Evaluate rescore windows for learning to rank
Closed, ResolvedPublic
Actions

Related Objects
Search...