We want to re-visit the analysis that we did in T175771 but with a new model used on hewiki.
Description
Status | Subtype | Assigned | Task | ||
---|---|---|---|---|---|
Invalid | None | T174064 [FY 2017-18 Objective] Implement advanced search methodologies | |||
Resolved | EBernhardson | T161632 [Epic] Improve search by researching and deploying machine learning to re-rank search results | |||
Resolved | EBernhardson | T182616 Re-run AB test for Hebrew Wikipedia (has > 1% of search traffic) with new model | |||
Resolved | • chelsyx | T183024 Analysis of hewiki's A/B test (> 1% of search traffic with a new model) |
Event Timeline
This test is ending when it rides the train and deploys to hewiki today. Please let @EBernhardson know if you have any questions or concerns about the data quality!
Thanks, @chelsyx and @mpopov, in advance — you two always do great work and we're looking forward to this test's analysis. :)
Thanks, Chelsy!
Comparing to the previous report on 18 wikis one thing I noticed is that the ZRR between control and LTR is not as big: 14.07% vs 16.51% in the last test; 13.47% vs 13.45% in the current test. ZRR isn't affected by LTR—it's a random property of the queries people choose to search—so now I'm wondering if some of the diffs in the previous test could be based on the ZRR diff. Question for @chelsyx: are all the metrics that require interaction with results only calculated on searches that actually got results? If so, then ZRR doesn't matter much, other than making the samples ~13-17% smaller. In particular, does abandonment rate include results that got no results?
LTR for Hebrew just can't quite get the best result into top place as often as the control—but engagement is still similar and dwell time and scroll numbers for LTR results are better than for control (and they were in the previous test as well). [I always worry that increased dwell time and scroll numbers reflect difficulty in finding what you are looking for.]
Overall, LTR now looks pretty much comparable to the control results—close enough to me to warrant switching to LTR since it offers an easier pathway for future improvement.
@TJones Yes, all the metrics that require interaction with results only calculated on searches that actually got results, including CTR, first clicked position, max clicked position and Search Abandon Rate.
LTR for Hebrew just can't quite get the best result into top place as often as the control—but engagement is still similar and dwell time and scroll numbers for LTR results are better than for control (and they were in the previous test as well). [I always worry that increased dwell time and scroll numbers reflect difficulty in finding what you are looking for.]
Yeah, @EBernhardson and I briefly talk about it on IRC a few weeks ago -- dwell time and scroll can go up for good reasons and bad. Without further study, it's hard to interpret them... :(