Today we were talking about how best to use human-graded survey data for the ML/LTR models (machine learning/learning to rank) as part of our longer term goals is to improve search. (see T171740: [Epic] Search Relevance: graded by humans)
Some of the open questions about how to best use the survey ratings in the training data include:
- Should long tail survey data be weighted in training?
- If so, weighted to represent unique queries, or queries by frequency count?
- Are long tail queries fundamentally different in some way from queries in the fat head or chunky middle?
- Does trying to improve the long tail decrease performance for the fat head and chunky middle?
- Do we get long-tail improvements when we work with just fat head and chunky middle training data?
- Are "fat head" and "chunky middle" objectively funny, or is it just me?
- Does long tail sampling and weighted representation in training properly represent the general features of general long tail queries or will we end up over-training on specific queries if they are weighted 100x or more?
- Are small net improvements in the long tail (and possibly any accompanying small net deteriorations in the head) incremental, or do they come with a lot of churn?
- That is, is a 1% net improvement a change in about 1% of results, or did 50% get a bit better and 49% get a bit worse?
- How evenly are any gains and losses evenly spread over the distribution?
- How do we figure this stuff out?
As we were pondering all this, it became clear that query-frequency-stratified results in A/B test analyses would help a lot!
A straw-man proposal would be:
- Take the queries in the sampled A/B test data, and normalize them in the same way queries are normalized for LTR training.
- Take all recorded queries over some time period (possibly larger than the test period, but ending at the same time) and normalize them. Also filter out outliers in the same way as is done for LTR training.
- Calculate the frequency for all queries in the A/B test.
- Stratify the queries by frequency. Options include:
- Simple quintiles or maybe deciles
- Something more complex that normalizes to standard thresholds, including the training data threshold of 10 queries in 90 days.
- Report the standard stats (ZRR, first click, max click, Paul Score, other SERP pages, dwell time, scroll, abandonment) per stratum.
This may be too much for looking at A/B tests on 18 wikis at once (18 is already a lot; another 90 sets of results by quintile would be insane), but for smaller A/B tests, it would be very interesting to see how changes are distributed across the frequency strata, and would most of the questions above (the total churn might require using RelForge since intra-stratum churn won't be detectable).
As a first test, re-doing the analysis for any recent A/B test where we can still get a few weeks' worth of total query data near the time of the original A/B test would be great. Actually, even using non-overlapping query data for a first test would be fine—just smooth any frequencies of 0 to 1, since the query obviously happened at least once.