Analyse results of query explorer AB test
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	EBernhardson
	Mar 15 2018, 11:45 PM

Description

An AB test was recently run on enwiki for T187148 from 20180302 through 20180316. The data is stored in HDFS as the SearchSatisfaction schema (hdfs://analytics-hadoop/wmf/data/raw/eventlogging/eventlogging_SearchSatisfaction/hourly/2018/03/*)

We ran 3 standard buckets:
control - The currently deployed ML ranker
classic - The classic non-ML ranker
explorer - The new ranker under test

We also ran 2 interleaved buckets:
control-explorer-i: control in A, explorer in B
classic-explorer-i: classic in A, explorer in B

The main question is if explorer is better than control. Offline testing suggests a strong improvement. Classic was also included to see how much improvement we have gained over the current FY by working on ML ranking.

Related Objects

Mentioned In: T192940: Run analysis on query explorer ab test
Mentioned Here: T176493: Analysis of testing on 18 wikis with > 1% of search traffic
T187148: Evaluate features provided by `query_explorer` functionality of ltr plugin

Event Timeline

EBernhardson created this task.Mar 15 2018, 11:45 PM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptMar 15 2018, 11:45 PM

This data needs to be extracted from sequence files on HDFS to TSV for analysis. I think we should still be able to use the method from T176493#3742836

EBernhardson updated the task description. (Show Details)Mar 15 2018, 11:51 PM

In T189843#4055199, @EBernhardson wrote:

This data needs to be extracted from sequence files on HDFS to TSV for analysis. I think we should still be able to use the method from T176493#3742836

Thanks @EBernhardson ! Don't worry about it. I can extract the data from searchsatisfaction on HDFS using R interface to Spark.

• chelsyx claimed this task.Mar 19 2018, 11:27 PM

• chelsyx triaged this task as Medium priority.

Report: https://analytics.wikimedia.org/datasets/discovery/reports/Evaluate_features_provided_by_query_explorer_functionality_of_ltr_plugin.html

I excluded data on 3/16 since there were very few records comparing to other days (only 6357 events).

Let me know if you have any questions! :D

Thanks @chelsyx!
Looks like it's a big win in all rounds for explorer, my take away is:

despite having a higher ZRR (which was not affected by the test) explorer wins. ZRR directly impact CTR, bad luck for explorer which could have had better CTR with similar ZRR.
CTR is significantly higher
explorer wins in all interleaved tests (I think it's the first time we see a group winning on every days)
there's a big win in CTR for explorer on day 11 which seems unnatural. Perhaps due to a big session this day.
every other graphs show a preference for explorer except the "Return to make a different search". But this one is hard to interpret and everything is within the error bounds.

I'd be curious to know what happened on day 11 but I don't think it'll affect my conclusions on this test.

I'm for enabling this new model in production.

Indeed this looks like the strongest improvement we've been able to test so far. I'll push this out to prod this week, and the models for the rest of the wikis with the new feature sets are processing now and should be ready this week as well.

One thought on the analysis, i think the Paul Score graphs used to have larger confidence intervals, which made it obvious they were overlapping. With the larger test sizes we've been running recently the confidence intervals have narrowed and I can't tell on the graph anymore if it beats the CI or not.

Oh also the load time graph. It's not super clear what the scale is, it looks like log scale ms? If it is, then the right edge of the graph that says 1e15 represents 31k years. We could probably chop some of the right side of this graph off :)

In T189843#4064039, @dcausse wrote:

despite having a higher ZRR (which was not affected by the test) explorer wins. ZRR directly impact CTR, bad luck for explorer which could have had better CTR with similar ZRR.

@dcausse CTR was not impacted by ZRR. We removed all the zero-result searches when computing CTR.

there's a big win in CTR for explorer on day 11 which seems unnatural. Perhaps due to a big session this day.

Yeah, I don't know what happened on March 11. I will look into that when I have time.

One thought on the analysis, i think the Paul Score graphs used to have larger confidence intervals, which made it obvious they were overlapping. With the larger test sizes we've been running recently the confidence intervals have narrowed and I can't tell on the graph anymore if it beats the CI or not.

@EBernhardson I will put the paulscore plot into 3 separate graphs and change the y-scale.

Oh also the load time graph. It's not super clear what the scale is, it looks like log scale ms? If it is, then the right edge of the graph that says 1e15 represents 31k years. We could probably chop some of the right side of this graph off :)

Will do!

The report is updated: https://analytics.wikimedia.org/datasets/discovery/reports/Evaluate_features_provided_by_query_explorer_functionality_of_ltr_plugin.html

Change the layout and the y-axis scale of the paulscore graph. Explorer's paulscore is statistically significantly higher!
Deleted records with super large load time

Thanks! it looks great.

EBernhardson moved this task from Tests & Analysis to Needs Reporting on the Discovery-Search (Current work) board.Mar 22 2018, 5:02 PM

debt closed this task as Resolved.Apr 3 2018, 3:53 PM

EBernhardson mentioned this in T192940: Run analysis on query explorer ab test.Apr 24 2018, 5:32 PM

Analyse results of query explorer AB testClosed, ResolvedPublicActions

Description

Related Objects

Event Timeline

Analyse results of query explorer AB test
Closed, ResolvedPublic
Actions