Page MenuHomePhabricator

Add support for interleaved results in 2-way A/B test
Closed, ResolvedPublic


Random differences in the queries pushed into the buckets of A/B tests can introduce significant noise into the various metrics we use to assess A/B tests. @mpopov did a quick analysis and found 1% random differences are not unlikely and 3-4% differences are possible.

One way to combat that is to have every query involved in both the A and B buckets, by interleaving the results. (Obviously this only works for changes that affect results ordering, not UI elements, etc.)

On the first page of results only, both options (it makes sense to limit it to two, at least for the first iteration) are run, and the results are interleaved, from the top down. At each position, elements are randomly ordered (sometimes A first, sometimes B first). Any result already shown higher up (say, because A's 1st result is also B's third result) are skipped. In theory, the first page of results could be twice as long as normal. In practice, it should be much less than that, but still longer than normal.

On subsequent pages, only results from the control group would be presented. This does allow for the likely possibility that a result from the test group made it to the first page of results (e.g., as the test groups's 3rd result) and is repeated on the second or subsequent page (e.g., as the control group's 28th result). This seems acceptable.

A quick worked example:

  • Control results: X, Y, Z, W, Q
  • Test results: W, Y, Q, R, S, T

Displayed results:

  • X (control 1st result, randomly chosen to go before W)
  • W (test 1st result)
  • Y (control and test 2nd result)
  • Q (test 3rd result, randomly chosen to go before Z)
  • Z (control 3rd result)
  • R (test 4th result; control 4th result, W, is already present above)
  • S (test 5th result, control 5th result, Q is already present above)
  • T (test 6th result, no control 6th result)

The re-ordering of shared results makes it unclear how heavily metrics like PaulScore will be affected (e.g., a click on W above should probably be counted as a click in 1st place for the test, and a click in 4th place for the control), so the first (and maybe second and third) test using this scheme should have 3 buckets: the test bucket, the control bucket, and the interleaved test/control bucket. This will allow us to test the test and see how metric from the interleaved buckets compare to traditional separated buckets.

Another alternative would be to only consider the control group's top 5 or top 10 results, since most clicks come from the top 5 and the vast majority from the top 10.

Related Objects

Event Timeline

How hard do we think this would be to implement? This seems like a lot of work, and it's unclear whether it will be useful. Thoughts?

The plan was to limit interleaving to the first page of results, which lowers the difficulty from impossible to merely challenging—though someone else will have to comment on whether it's really impossible --> challenging or impossible^3 --> impossible^2.

I do think it would be useful because it could eliminate the effects of random assignment to buckets in the A/B tests, which are a significant source of noise. Bots and other outliers would be less of a problem because they'd be identically represented in the A and B buckets. The 1st vs 2nd position ordering would still be an issue, hence the idea of having an initial test bucket, control bucket, and interleaved bucket test.

Deskana moved this task from needs triage to search-icebox on the Discovery-Search board.

Some reference papers:

Large-scale validation and analysis of interleaved search evaluation - Evaluation of initial interleaved search evaluation proposals (balanced and team-draft interleaving) on bing, yahoo, and These evaluations vary in size from a few thousand queries to 10's of millions of queries. Section 7 shows that with classic metrics (session abandonment, clicks per query, clicks@1, etc) at between 10^5 and 10^6, and sometimes more, user queries are required to determine with high probability which of two ranking systems is better in a classical (how we run them) AB test. They go on to show that for the specific metric of clicks@1 they need 20 to 400 times less data (depending on the size of difference between two ranking functions) to determine with high probability which ranking function is better.

Optimized Interleaving for Online Retrieval Evaluation - An optimization framework for resolving issues with the initial interleaved approaches. The first paper above provides examples of interleaving that come up with the wrong answer, this tries to address it. I can't quite figure out what they are proposing as the actual algorithm, although it seems to be provided as:

The optimized interleaving algorithm thus reduces to maximizing the expected sensitivity (Equation 13) subject to the constraints presented in Equations 5 through 8

Predicting Search Satisfaction Metrics with Interleaved Comparisons - Shows that the predictive power of TDI (team draft interleaving) vs AB testing, and again that TDI requires significantly fewer samples (10^3 to 10^6) vs classical testing (10^6 to 10^8)

In terms of actual implementation, from my review i think we can implement team draft interleaving relatively easily. And while it has some known problems, I think it is still an improvement over our current AB testing methodology.

debt subscribed.

Moving to the search sprint board as we're actively working on this.

Nice job, @EBernhardson for finding all these technical documents! :)

The optimized interleaving paper (2nd in Erik's list) turns every presentation of results into an optimization problem that requires you to choose from all valid interleavings. Section 3.4.3 says they have empirically proven that for particular credit functions there are solutions for rankings up to length 10, but there's no theoretical proofs that solutions exist for other credit functions or longer rankings. In section 5.3.1, they say "Team Draft and Balanced interleaving have
been shown to both be reliable in previous online evaluations"—meaning that while there are breaking cases for those interleaving schemes, in practice, they work fine. Let's use of one of those! TDI seems to be more popular. I think Balanced seems (naively) to be more "fair", but I'm very much okay with TDI, too.

Tangentially, the comments in section 5.3.2 make my Discernatron-related angst increase. TL. DR: expert judges and real-life users often disagree significantly. Sigh.

Also tangentially, section 3.3.2 brings up a nice point about sometimes not wanting to take too great a risk in showing results from your unproven "experimental" ranker. They don't give any concrete suggestions, though. An obvious hack would be to always let the proven ranker go first, and let it take two turns to every one turn from the experimental ranker. When they agree, nothing changes, but when the experimental ranker gives outrageous results (or good results), it can't ever get anything higher than position 3. It's a disadvantage, but that's appropriate with a risky ranker. This would apply to @dcausse's idea for a high-recall scorer, not Erik's trained LTR models (which seem much lower risk, because they at least appear to be doing well in training and testing).

We'll have to pick a credit function too. Based on the third paper, the TDI metric—number of clicks on documents ranked higher by a given ranker than by the other—is easy, and seems reasonably robust. Of course, if we gather the right data, we could calculate any of those credit functions during evaluation of the test. Their user satisfaction model (section 3.1.3) does a lot better (Table 4) but is an extra project to train. @mpopov, could we apply our current engagement metric here?

Okay, I'm done for now... ;)

Change 366190 had a related patch set uploaded (by EBernhardson; owner: EBernhardson):
[mediawiki/core@master] Support custom offsets from SearchResultSet

Change 366194 had a related patch set uploaded (by EBernhardson; owner: EBernhardson):
[mediawiki/extensions/CirrusSearch@master] Team Draft interleaved AB testing

Change 366190 merged by jenkins-bot:
[mediawiki/core@master] Support custom offsets from SearchResultSet

Might want to evaluate how many sessions we need to record to have a definitive answer about which search algo being tested is better, to some level of confidence. I'm not actually sure how many sessions we typically collect for these purposes in previous tests, we simply went with a sampling rate that seemed like "enough". Can we be more precise about whats is enough, and what that amount means about the power of the test?

Change 366194 merged by jenkins-bot:
[mediawiki/extensions/CirrusSearch@master] Team Draft interleaved AB testing

Change 368470 had a related patch set uploaded (by EBernhardson; owner: EBernhardson):
[mediawiki/extensions/WikimediaEvents@master] Record interleaved search teams if available

Change 368470 merged by jenkins-bot:
[mediawiki/extensions/WikimediaEvents@master] Record interleaved search teams if available

Change 369644 had a related patch set uploaded (by DCausse; owner: EBernhardson):
[mediawiki/extensions/WikimediaEvents@wmf/1.30.0-wmf.12] Record interleaved search teams if available

Change 369644 merged by jenkins-bot:
[mediawiki/extensions/WikimediaEvents@wmf/1.30.0-wmf.12] Record interleaved search teams if available

@EBernhardson: I went through that large-scale article. Right now I'm writing tools for calculating sample size in interleaved experiments and analyzing results.

Change 370303 had a related patch set uploaded (by Bearloga; owner: Bearloga):
[wikimedia/discovery/wmf@master] [WIP] Add functions for working with interleaved experiments

Resolving this ticket, as the analysis is nearly complete: T171215

Change 370303 merged by Chelsyx:
[wikimedia/discovery/wmf@master] Add functions for working with interleaved experiments