Random differences in the queries pushed into the buckets of A/B tests can introduce significant noise into the various metrics we use to assess A/B tests. @mpopov did a quick analysis and found 1% random differences are not unlikely and 3-4% differences are possible.
One way to combat that is to have every query involved in both the A and B buckets, by interleaving the results. (Obviously this only works for changes that affect results ordering, not UI elements, etc.)
On the first page of results only, both options (it makes sense to limit it to two, at least for the first iteration) are run, and the results are interleaved, from the top down. At each position, elements are randomly ordered (sometimes A first, sometimes B first). Any result already shown higher up (say, because A's 1st result is also B's third result) are skipped. In theory, the first page of results could be twice as long as normal. In practice, it should be much less than that, but still longer than normal.
On subsequent pages, only results from the control group would be presented. This does allow for the likely possibility that a result from the test group made it to the first page of results (e.g., as the test groups's 3rd result) and is repeated on the second or subsequent page (e.g., as the control group's 28th result). This seems acceptable.
A quick worked example:
- Control results: X, Y, Z, W, Q
- Test results: W, Y, Q, R, S, T
Displayed results:
- X (control 1st result, randomly chosen to go before W)
- W (test 1st result)
- Y (control and test 2nd result)
- Q (test 3rd result, randomly chosen to go before Z)
- Z (control 3rd result)
- R (test 4th result; control 4th result, W, is already present above)
- S (test 5th result, control 5th result, Q is already present above)
- T (test 6th result, no control 6th result)
The re-ordering of shared results makes it unclear how heavily metrics like PaulScore will be affected (e.g., a click on W above should probably be counted as a click in 1st place for the test, and a click in 4th place for the control), so the first (and maybe second and third) test using this scheme should have 3 buckets: the test bucket, the control bucket, and the interleaved test/control bucket. This will allow us to test the test and see how metric from the interleaved buckets compare to traditional separated buckets.
Another alternative would be to only consider the control group's top 5 or top 10 results, since most clicks come from the top 5 and the vast majority from the top 10.