[epic] Show query-frequency-stratified results in A/B test results
Closed, DeclinedPublic
Actions

Assigned To

None

Authored By

	TJones
	Dec 13 2017, 8:54 PM

Description

Today we were talking about how best to use human-graded survey data for the ML/LTR models (machine learning/learning to rank) as part of our longer term goals is to improve search. (see T171740: [Epic] Search Relevance: graded by humans)

Some of the open questions about how to best use the survey ratings in the training data include:

Should long tail survey data be weighted in training?
- If so, weighted to represent unique queries, or queries by frequency count?
Are long tail queries fundamentally different in some way from queries in the fat head or chunky middle?
- Does trying to improve the long tail decrease performance for the fat head and chunky middle?
- Do we get long-tail improvements when we work with just fat head and chunky middle training data?
- Are "fat head" and "chunky middle" objectively funny, or is it just me?
Does long tail sampling and weighted representation in training properly represent the general features of general long tail queries or will we end up over-training on specific queries if they are weighted 100x or more?
Are small net improvements in the long tail (and possibly any accompanying small net deteriorations in the head) incremental, or do they come with a lot of churn?
- That is, is a 1% net improvement a change in about 1% of results, or did 50% get a bit better and 49% get a bit worse?
- How evenly are any gains and losses evenly spread over the distribution?
How do we figure this stuff out?

As we were pondering all this, it became clear that query-frequency-stratified results in A/B test analyses would help a lot!

A straw-man proposal would be:

Take the queries in the sampled A/B test data, and normalize them in the same way queries are normalized for LTR training.
Take all recorded queries over some time period (possibly larger than the test period, but ending at the same time) and normalize them. Also filter out outliers in the same way as is done for LTR training.
Calculate the frequency for all queries in the A/B test.
Stratify the queries by frequency. Options include:
- Simple quintiles or maybe deciles
- Something more complex that normalizes to standard thresholds, including the training data threshold of 10 queries in 90 days.
Report the standard stats (ZRR, first click, max click, Paul Score, other SERP pages, dwell time, scroll, abandonment) per stratum.

This may be too much for looking at A/B tests on 18 wikis at once (18 is already a lot; another 90 sets of results by quintile would be insane), but for smaller A/B tests, it would be very interesting to see how changes are distributed across the frequency strata, and would most of the questions above (the total churn might require using RelForge since intra-stratum churn won't be detectable).

As a first test, re-doing the analysis for any recent A/B test where we can still get a few weeks' worth of total query data near the time of the original A/B test would be great. Actually, even using non-overlapping query data for a first test would be fine—just smooth any frequencies of 0 to 1, since the query obviously happened at least once.

Related Objects
Search...

		Status	Subtype	Assigned	Task
		Declined		None	T182824 [epic] Show query-frequency-stratified results in A/B test results
		Declined		None	T183025 Analysis: query-frequency-stratified A/B test

Event Timeline

TJones created this task.Dec 13 2017, 8:54 PM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptDec 13 2017, 8:54 PM

debt renamed this task from Show query-frequency-stratified results in A/B test results to [epic] Show query-frequency-stratified results in A/B test results.Dec 15 2017, 4:46 PM

debt triaged this task as Medium priority.

debt edited projects, added Discovery-Search (Current work); removed Discovery-Search.

debt moved this task from Incoming to Tests & Analysis on the Discovery-Search (Current work) board.

debt added a project: Epic.

Let's chat about the straw-man proposal during our sprint planning meeting on Dec 19, 2017 :)

TJones mentioned this in T175048: Search Relevance Survey test #3: analysis of test.Dec 20 2017, 2:31 PM

@chelsyx would you have time to look into this next?

• Phabricator_maintenance added a project: Product-Analytics.Apr 18 2018, 11:20 PM

• Phabricator_maintenance removed a project: Product-Analytics.Apr 19 2018, 12:20 AM

Restricted Application added a project: Product-Analytics. · View Herald TranscriptApr 19 2018, 12:20 AM

mpopov closed subtask T183025: Analysis: query-frequency-stratified A/B test as Declined.May 10 2018, 8:11 PM

• JKatzWMF moved this task from Triage to Tracking on the Product-Analytics board.Jun 14 2018, 8:22 PM

TJones edited projects, added Discovery-Search; removed Discovery-Search (Current work).Jan 29 2019, 9:48 PM

TJones moved this task from needs triage to Tests & Analysis on the Discovery-Search board.

@debt declining as the subtask was declined; please let us know if this is something to revisit when a PM for Search comes onboad

[epic] Show query-frequency-stratified results in A/B test resultsClosed, DeclinedPublicActions

Description

Related ObjectsSearch...

Event Timeline

[epic] Show query-frequency-stratified results in A/B test results
Closed, DeclinedPublic
Actions

Related Objects
Search...