Page MenuHomePhabricator

Analysis of hewiki's A/B test (> 1% of search traffic with a new model)
Closed, ResolvedPublic

Description

We want to re-visit the analysis that we did in T175771 but with a new model used on hewiki.

Event Timeline

debt triaged this task as Medium priority.Dec 15 2017, 7:22 PM
debt created this task.

This test is ending when it rides the train and deploys to hewiki today. Please let @EBernhardson know if you have any questions or concerns about the data quality!

Thanks, @chelsyx and @mpopov, in advance — you two always do great work and we're looking forward to this test's analysis. :)

Thanks, Chelsy!

Comparing to the previous report on 18 wikis one thing I noticed is that the ZRR between control and LTR is not as big: 14.07% vs 16.51% in the last test; 13.47% vs 13.45% in the current test. ZRR isn't affected by LTR—it's a random property of the queries people choose to search—so now I'm wondering if some of the diffs in the previous test could be based on the ZRR diff. Question for @chelsyx: are all the metrics that require interaction with results only calculated on searches that actually got results? If so, then ZRR doesn't matter much, other than making the samples ~13-17% smaller. In particular, does abandonment rate include results that got no results?

LTR for Hebrew just can't quite get the best result into top place as often as the control—but engagement is still similar and dwell time and scroll numbers for LTR results are better than for control (and they were in the previous test as well). [I always worry that increased dwell time and scroll numbers reflect difficulty in finding what you are looking for.]

Overall, LTR now looks pretty much comparable to the control results—close enough to me to warrant switching to LTR since it offers an easier pathway for future improvement.

Question for @chelsyx: are all the metrics that require interaction with results only calculated on searches that actually got results? If so, then ZRR doesn't matter much, other than making the samples ~13-17% smaller. In particular, does abandonment rate include results that got no results?

@TJones Yes, all the metrics that require interaction with results only calculated on searches that actually got results, including CTR, first clicked position, max clicked position and Search Abandon Rate.

LTR for Hebrew just can't quite get the best result into top place as often as the control—but engagement is still similar and dwell time and scroll numbers for LTR results are better than for control (and they were in the previous test as well). [I always worry that increased dwell time and scroll numbers reflect difficulty in finding what you are looking for.]

Yeah, @EBernhardson and I briefly talk about it on IRC a few weeks ago -- dwell time and scroll can go up for good reasons and bad. Without further study, it's hard to interpret them... :(

Shall we move this into the "Done" column as the report looks to have been reviewed?

debt moved this task from Needs review to Done on the Discovery-Analysis (Current work) board.

Thanks for all the info and Q&A! :)