Add support for interleaved results in 2-way A/B test
Closed, ResolvedPublic
Actions

Description

Random differences in the queries pushed into the buckets of A/B tests can introduce significant noise into the various metrics we use to assess A/B tests. @mpopov did a quick analysis and found 1% random differences are not unlikely and 3-4% differences are possible.

One way to combat that is to have every query involved in both the A and B buckets, by interleaving the results. (Obviously this only works for changes that affect results ordering, not UI elements, etc.)

On the first page of results only, both options (it makes sense to limit it to two, at least for the first iteration) are run, and the results are interleaved, from the top down. At each position, elements are randomly ordered (sometimes A first, sometimes B first). Any result already shown higher up (say, because A's 1st result is also B's third result) are skipped. In theory, the first page of results could be twice as long as normal. In practice, it should be much less than that, but still longer than normal.

On subsequent pages, only results from the control group would be presented. This does allow for the likely possibility that a result from the test group made it to the first page of results (e.g., as the test groups's 3rd result) and is repeated on the second or subsequent page (e.g., as the control group's 28th result). This seems acceptable.

A quick worked example:

Control results: X, Y, Z, W, Q
Test results: W, Y, Q, R, S, T

Displayed results:

X (control 1st result, randomly chosen to go before W)
W (test 1st result)
Y (control and test 2nd result)
Q (test 3rd result, randomly chosen to go before Z)
Z (control 3rd result)
R (test 4th result; control 4th result, W, is already present above)
S (test 5th result, control 5th result, Q is already present above)
T (test 6th result, no control 6th result)

The re-ordering of shared results makes it unclear how heavily metrics like PaulScore will be affected (e.g., a click on W above should probably be counted as a click in 1st place for the test, and a click in 4th place for the control), so the first (and maybe second and third) test using this scheme should have 3 buckets: the test bucket, the control bucket, and the interleaved test/control bucket. This will allow us to test the test and see how metric from the interleaved buckets compare to traditional separated buckets.

Another alternative would be to only consider the control group's top 5 or top 10 results, since most clicks come from the top 5 and the vast majority from the top 10.

Details

Subject	Repo	Branch	Lines +/-
Add functions for working with interleaved experiments	wikimedia/discovery/wmf	master	+1 K -768
Record interleaved search teams if available	mediawiki/extensions/WikimediaEvents	wmf/1.30.0-wmf.12	+7 -0
Record interleaved search teams if available	mediawiki/extensions/WikimediaEvents	master	+7 -0
Team Draft interleaved AB testing	mediawiki/extensions/CirrusSearch	master	+9 K -11
Support custom offsets from SearchResultSet	mediawiki/core	master	+19 -1

Customize query in gerrit

Related Objects
Search...

Status	Assigned	Task
Invalid	None	T174064 [FY 2017-18 Objective] Implement advanced search methodologies
Resolved	EBernhardson	T161632 [Epic] Improve search by researching and deploying machine learning to re-rank search results
Resolved	EBernhardson	T162369 Evaluate rescore windows for learning to rank
Resolved	None	T174066 [Q1 2017-18 Objective] Perform load and A/B tests on new models (interleaved search results)
Resolved	EBernhardson	T150032 Add support for interleaved results in 2-way A/B test
Resolved	debt	T171212 Interleaved results A/B test: turn on
Resolved	EBernhardson	T171213 Interleaved results A/B test: check that data is flowing the way we expect
Resolved	debt	T171214 Interleaved results A/B test: turn off test
Resolved	mpopov	T171215 Interleaved results A/B test: analysis of data
Declined	EBernhardson	T171984 Turn on test of LTR with standard AB buckets and an interleaved bucket.

Event Timeline

TJones created this task.Nov 4 2016, 5:07 PM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptNov 4 2016, 5:07 PM

TJones added projects: CirrusSearch, Discovery-Search, Discovery-ARCHIVED.Nov 4 2016, 5:07 PM

How hard do we think this would be to implement? This seems like a lot of work, and it's unclear whether it will be useful. Thoughts?

The plan was to limit interleaving to the first page of results, which lowers the difficulty from impossible to merely challenging—though someone else will have to comment on whether it's really impossible --> challenging or impossible^3 --> impossible^2.

I do think it would be useful because it could eliminate the effects of random assignment to buckets in the A/B tests, which are a significant source of noise. Bots and other outliers would be less of a problem because they'd be identically represented in the A and B buckets. The 1st vs 2nd position ordering would still be an issue, hence the idea of having an initial test bucket, control bucket, and interleaved bucket test.

• Deskana triaged this task as Medium priority.Dec 15 2016, 11:13 PM

• Deskana moved this task from needs triage to search-icebox on the Discovery-Search board.

Some reference papers:

Large-scale validation and analysis of interleaved search evaluation - Evaluation of initial interleaved search evaluation proposals (balanced and team-draft interleaving) on bing, yahoo, and arxiv.org. These evaluations vary in size from a few thousand queries to 10's of millions of queries. Section 7 shows that with classic metrics (session abandonment, clicks per query, clicks@1, etc) at between 10^5 and 10^6, and sometimes more, user queries are required to determine with high probability which of two ranking systems is better in a classical (how we run them) AB test. They go on to show that for the specific metric of clicks@1 they need 20 to 400 times less data (depending on the size of difference between two ranking functions) to determine with high probability which ranking function is better.

Optimized Interleaving for Online Retrieval Evaluation - An optimization framework for resolving issues with the initial interleaved approaches. The first paper above provides examples of interleaving that come up with the wrong answer, this tries to address it. I can't quite figure out what they are proposing as the actual algorithm, although it seems to be provided as:

The optimized interleaving algorithm thus reduces to maximizing the expected sensitivity (Equation 13) subject to the constraints presented in Equations 5 through 8

Predicting Search Satisfaction Metrics with Interleaved Comparisons - Shows that the predictive power of TDI (team draft interleaving) vs AB testing, and again that TDI requires significantly fewer samples (10^3 to 10^6) vs classical testing (10^6 to 10^8)

In terms of actual implementation, from my review i think we can implement team draft interleaving relatively easily. And while it has some known problems, I think it is still an improvement over our current AB testing methodology.

Moving to the search sprint board as we're actively working on this.

debt moved this task from Incoming to not in use - please delete on the Discovery-Search (Current work) board.Jun 29 2017, 5:12 PM

debt moved this task from not in use - please delete to Incoming on the Discovery-Search (Current work) board.

Nice job, @EBernhardson for finding all these technical documents! :)

The optimized interleaving paper (2nd in Erik's list) turns every presentation of results into an optimization problem that requires you to choose from all valid interleavings. Section 3.4.3 says they have empirically proven that for particular credit functions there are solutions for rankings up to length 10, but there's no theoretical proofs that solutions exist for other credit functions or longer rankings. In section 5.3.1, they say "Team Draft and Balanced interleaving have
been shown to both be reliable in previous online evaluations"—meaning that while there are breaking cases for those interleaving schemes, in practice, they work fine. Let's use of one of those! TDI seems to be more popular. I think Balanced seems (naively) to be more "fair", but I'm very much okay with TDI, too.

Tangentially, the comments in section 5.3.2 make my Discernatron-related angst increase. TL. DR: expert judges and real-life users often disagree significantly. Sigh.

Also tangentially, section 3.3.2 brings up a nice point about sometimes not wanting to take too great a risk in showing results from your unproven "experimental" ranker. They don't give any concrete suggestions, though. An obvious hack would be to always let the proven ranker go first, and let it take two turns to every one turn from the experimental ranker. When they agree, nothing changes, but when the experimental ranker gives outrageous results (or good results), it can't ever get anything higher than position 3. It's a disadvantage, but that's appropriate with a risky ranker. This would apply to @dcausse's idea for a high-recall scorer, not Erik's trained LTR models (which seem much lower risk, because they at least appear to be doing well in training and testing).

We'll have to pick a credit function too. Based on the third paper, the TDI metric—number of clicks on documents ranked higher by a given ranker than by the other—is easy, and seems reasonably robust. Of course, if we gather the right data, we could calculate any of those credit functions during evaluation of the test. Their user satisfaction model (section 3.1.3) does a lot better (Table 4) but is an extra project to train. @mpopov, could we apply our current engagement metric here?

Okay, I'm done for now... ;)

@mpopov @chelsyx Letting you two know this is coming up

Change 366190 had a related patch set uploaded (by EBernhardson; owner: EBernhardson):
[mediawiki/core@master] Support custom offsets from SearchResultSet

https://gerrit.wikimedia.org/r/366190

gerritbot added a project: Patch-For-Review.Jul 19 2017, 4:23 AM

Change 366194 had a related patch set uploaded (by EBernhardson; owner: EBernhardson):
[mediawiki/extensions/CirrusSearch@master] Team Draft interleaved AB testing

https://gerrit.wikimedia.org/r/366194

Change 366190 merged by jenkins-bot:
[mediawiki/core@master] Support custom offsets from SearchResultSet

https://gerrit.wikimedia.org/r/366190

ReleaseTaggerBot added a project: MW-1.30-release-notes.Jul 19 2017, 11:00 AM

debt created subtask T171212: Interleaved results A/B test: turn on.Jul 20 2017, 7:51 PM

debt created subtask T171213: Interleaved results A/B test: check that data is flowing the way we expect.

debt created subtask T171214: Interleaved results A/B test: turn off test.Jul 20 2017, 7:53 PM

debt created subtask T171215: Interleaved results A/B test: analysis of data.

Might want to evaluate how many sessions we need to record to have a definitive answer about which search algo being tested is better, to some level of confidence. I'm not actually sure how many sessions we typically collect for these purposes in previous tests, we simply went with a sampling rate that seemed like "enough". Can we be more precise about whats is enough, and what that amount means about the power of the test?

EBernhardson moved this task from not in use - please delete to Needs review on the Discovery-Search (Current work) board.Jul 28 2017, 4:01 PM

Change 366194 merged by jenkins-bot:
[mediawiki/extensions/CirrusSearch@master] Team Draft interleaved AB testing

https://gerrit.wikimedia.org/r/366194

ReleaseTaggerBot edited projects, added MW-1.30-release-notes (WMF-deploy-2017-08-01_(1.30.0-wmf.12)); removed MW-1.30-release-notes.Jul 28 2017, 5:00 PM

EBernhardson created subtask T171984: Turn on test of LTR with standard AB buckets and an interleaved bucket..Jul 28 2017, 6:52 PM

EBernhardson closed subtask T171984: Turn on test of LTR with standard AB buckets and an interleaved bucket. as Declined.Jul 28 2017, 6:59 PM

Change 368470 had a related patch set uploaded (by EBernhardson; owner: EBernhardson):
[mediawiki/extensions/WikimediaEvents@master] Record interleaved search teams if available

https://gerrit.wikimedia.org/r/368470

Change 368470 merged by jenkins-bot:
[mediawiki/extensions/WikimediaEvents@master] Record interleaved search teams if available

https://gerrit.wikimedia.org/r/368470

ReleaseTaggerBot edited projects, added MW-1.30-release-notes (WMF-deploy-2017-08-08_(1.30.0-wmf.13)); removed MW-1.30-release-notes (WMF-deploy-2017-08-01_(1.30.0-wmf.12)).Aug 2 2017, 1:00 PM

Change 369644 had a related patch set uploaded (by DCausse; owner: EBernhardson):
[mediawiki/extensions/WikimediaEvents@wmf/1.30.0-wmf.12] Record interleaved search teams if available

https://gerrit.wikimedia.org/r/369644

EBernhardson moved this task from Needs review to Needs Reporting on the Discovery-Search (Current work) board.Aug 2 2017, 10:35 PM

Change 369644 merged by jenkins-bot:
[mediawiki/extensions/WikimediaEvents@wmf/1.30.0-wmf.12] Record interleaved search teams if available

https://gerrit.wikimedia.org/r/369644

ReleaseTaggerBot edited projects, added MW-1.30-release-notes (WMF-deploy-2017-08-01_(1.30.0-wmf.12)); removed MW-1.30-release-notes (WMF-deploy-2017-08-08_(1.30.0-wmf.13)).Aug 3 2017, 12:00 AM

debt added a parent task: T162369: Evaluate rescore windows for learning to rank.Aug 3 2017, 9:52 PM

@EBernhardson: I went through that large-scale article. Right now I'm writing tools for calculating sample size in interleaved experiments and analyzing results.

Change 370303 had a related patch set uploaded (by Bearloga; owner: Bearloga):
[wikimedia/discovery/wmf@master] [WIP] Add functions for working with interleaved experiments

https://gerrit.wikimedia.org/r/370303

mpopov mentioned this in R1821:fd52edb61748: [WIP] Add functions for working with interleaved experiments.Aug 5 2017, 3:10 AM

mpopov mentioned this in R1821:388fef5f1a1f: [WIP] Add functions for working with interleaved experiments.

mpopov mentioned this in R1821:48b8d10eebd4: [WIP] Add functions for working with interleaved experiments.

debt closed subtask T171213: Interleaved results A/B test: check that data is flowing the way we expect as Resolved.Aug 14 2017, 6:45 PM

debt closed subtask T171212: Interleaved results A/B test: turn on as Resolved.

debt closed subtask T171214: Interleaved results A/B test: turn off test as Resolved.Aug 17 2017, 6:18 PM

EBernhardson reopened subtask T171213: Interleaved results A/B test: check that data is flowing the way we expect as Open.Aug 17 2017, 9:39 PM

EBernhardson moved this task from Needs Reporting to not in use - please delete on the Discovery-Search (Current work) board.Aug 18 2017, 1:00 AM

EBernhardson reopened subtask T171214: Interleaved results A/B test: turn off test as Open.

debt closed subtask T171214: Interleaved results A/B test: turn off test as Resolved.Aug 30 2017, 8:20 PM

• EBjune added a parent task: T174066: [Q1 2017-18 Objective] Perform load and A/B tests on new models (interleaved search results).Aug 31 2017, 4:46 PM