Page MenuHomePhabricator

Analyse results of query explorer AB test
Closed, ResolvedPublic

Description

An AB test was recently run on enwiki for T187148 from 20180302 through 20180316. The data is stored in HDFS as the SearchSatisfaction schema (hdfs://analytics-hadoop/wmf/data/raw/eventlogging/eventlogging_SearchSatisfaction/hourly/2018/03/*)

We ran 3 standard buckets:
control - The currently deployed ML ranker
classic - The classic non-ML ranker
explorer - The new ranker under test

We also ran 2 interleaved buckets:
control-explorer-i: control in A, explorer in B
classic-explorer-i: classic in A, explorer in B

The main question is if explorer is better than control. Offline testing suggests a strong improvement. Classic was also included to see how much improvement we have gained over the current FY by working on ML ranking.

Event Timeline

This data needs to be extracted from sequence files on HDFS to TSV for analysis. I think we should still be able to use the method from T176493#3742836

This data needs to be extracted from sequence files on HDFS to TSV for analysis. I think we should still be able to use the method from T176493#3742836

Thanks @EBernhardson ! Don't worry about it. I can extract the data from searchsatisfaction on HDFS using R interface to Spark.

chelsyx triaged this task as Medium priority.

Report: https://analytics.wikimedia.org/datasets/discovery/reports/Evaluate_features_provided_by_query_explorer_functionality_of_ltr_plugin.html

I excluded data on 3/16 since there were very few records comparing to other days (only 6357 events).

Let me know if you have any questions! :D

Thanks @chelsyx!
Looks like it's a big win in all rounds for explorer, my take away is:

  • despite having a higher ZRR (which was not affected by the test) explorer wins. ZRR directly impact CTR, bad luck for explorer which could have had better CTR with similar ZRR.
  • CTR is significantly higher
  • explorer wins in all interleaved tests (I think it's the first time we see a group winning on every days)
  • there's a big win in CTR for explorer on day 11 which seems unnatural. Perhaps due to a big session this day.
  • every other graphs show a preference for explorer except the "Return to make a different search". But this one is hard to interpret and everything is within the error bounds.

I'd be curious to know what happened on day 11 but I don't think it'll affect my conclusions on this test.

I'm for enabling this new model in production.

Indeed this looks like the strongest improvement we've been able to test so far. I'll push this out to prod this week, and the models for the rest of the wikis with the new feature sets are processing now and should be ready this week as well.

One thought on the analysis, i think the Paul Score graphs used to have larger confidence intervals, which made it obvious they were overlapping. With the larger test sizes we've been running recently the confidence intervals have narrowed and I can't tell on the graph anymore if it beats the CI or not.

Oh also the load time graph. It's not super clear what the scale is, it looks like log scale ms? If it is, then the right edge of the graph that says 1e15 represents 31k years. We could probably chop some of the right side of this graph off :)

  • despite having a higher ZRR (which was not affected by the test) explorer wins. ZRR directly impact CTR, bad luck for explorer which could have had better CTR with similar ZRR.

@dcausse CTR was not impacted by ZRR. We removed all the zero-result searches when computing CTR.

  • there's a big win in CTR for explorer on day 11 which seems unnatural. Perhaps due to a big session this day.

Yeah, I don't know what happened on March 11. I will look into that when I have time.

One thought on the analysis, i think the Paul Score graphs used to have larger confidence intervals, which made it obvious they were overlapping. With the larger test sizes we've been running recently the confidence intervals have narrowed and I can't tell on the graph anymore if it beats the CI or not.

@EBernhardson I will put the paulscore plot into 3 separate graphs and change the y-scale.

Oh also the load time graph. It's not super clear what the scale is, it looks like log scale ms? If it is, then the right edge of the graph that says 1e15 represents 31k years. We could probably chop some of the right side of this graph off :)

Will do!

The report is updated: https://analytics.wikimedia.org/datasets/discovery/reports/Evaluate_features_provided_by_query_explorer_functionality_of_ltr_plugin.html

  • Change the layout and the y-axis scale of the paulscore graph. Explorer's paulscore is statistically significantly higher!
  • Deleted records with super large load time