1% of search traffic with a new model)
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	debt
	Dec 15 2017, 7:22 PM

Description

We want to re-visit the analysis that we did in T175771 but with a new model used on hewiki.

Related Objects
Search...

Status	Assigned	Task
Invalid	None	T174064 [FY 2017-18 Objective] Implement advanced search methodologies
Resolved	EBernhardson	T161632 [Epic] Improve search by researching and deploying machine learning to re-rank search results
Resolved	EBernhardson	T182616 Re-run AB test for Hebrew Wikipedia (has > 1% of search traffic) with new model
Resolved	• chelsyx	T183024 Analysis of hewiki's A/B test (> 1% of search traffic with a new model)

Event Timeline

debt triaged this task as Medium priority.Dec 15 2017, 7:22 PM

debt created this task.

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptDec 15 2017, 7:22 PM

This test is ending when it rides the train and deploys to hewiki today. Please let @EBernhardson know if you have any questions or concerns about the data quality!

Thanks, @chelsyx and @mpopov, in advance — you two always do great work and we're looking forward to this test's analysis. :)

• chelsyx claimed this task.Jan 18 2018, 9:44 PM

• chelsyx moved this task from Backlog to In progress on the Discovery-Analysis (Current work) board.

Done! :)
Report: https://analytics.wikimedia.org/datasets/discovery/reports/Second_MLR_Test_for_hewiki.html

Thanks, Chelsy!

Comparing to the previous report on 18 wikis one thing I noticed is that the ZRR between control and LTR is not as big: 14.07% vs 16.51% in the last test; 13.47% vs 13.45% in the current test. ZRR isn't affected by LTR—it's a random property of the queries people choose to search—so now I'm wondering if some of the diffs in the previous test could be based on the ZRR diff. Question for @chelsyx: are all the metrics that require interaction with results only calculated on searches that actually got results? If so, then ZRR doesn't matter much, other than making the samples ~13-17% smaller. In particular, does abandonment rate include results that got no results?

LTR for Hebrew just can't quite get the best result into top place as often as the control—but engagement is still similar and dwell time and scroll numbers for LTR results are better than for control (and they were in the previous test as well). [I always worry that increased dwell time and scroll numbers reflect difficulty in finding what you are looking for.]

Overall, LTR now looks pretty much comparable to the control results—close enough to me to warrant switching to LTR since it offers an easier pathway for future improvement.

In T183024#3939131, @TJones wrote:

Question for @chelsyx: are all the metrics that require interaction with results only calculated on searches that actually got results? If so, then ZRR doesn't matter much, other than making the samples ~13-17% smaller. In particular, does abandonment rate include results that got no results?

@TJones Yes, all the metrics that require interaction with results only calculated on searches that actually got results, including CTR, first clicked position, max clicked position and Search Abandon Rate.

LTR for Hebrew just can't quite get the best result into top place as often as the control—but engagement is still similar and dwell time and scroll numbers for LTR results are better than for control (and they were in the previous test as well). [I always worry that increased dwell time and scroll numbers reflect difficulty in finding what you are looking for.]

Yeah, @EBernhardson and I briefly talk about it on IRC a few weeks ago -- dwell time and scroll can go up for good reasons and bad. Without further study, it's hard to interpret them... :(

Shall we move this into the "Done" column as the report looks to have been reviewed?

Thanks for all the info and Q&A! :)

EBernhardson mentioned this in T182616: Re-run AB test for Hebrew Wikipedia (has > 1% of search traffic) with new model.Mar 22 2018, 5:16 PM

Analysis of hewiki's A/B test (> 1% of search traffic with a new model)Closed, ResolvedPublicActions

Description

Related ObjectsSearch...

Event Timeline

Analysis of hewiki's A/B test (> 1% of search traffic with a new model)
Closed, ResolvedPublic
Actions

Related Objects
Search...