Page MenuHomePhabricator

Analyze Media Search A/B test
Closed, ResolvedPublic

Description

Now that the A/B test in T254388 is complete, we need to analyze the results to determine whether we can move forward with using the new MediaSearch results.

The plan is to use the simple analysis of preference from interleaved A/B tests described here: Estimating Preference For Ranking Functions With Clicks On Interleaved Search Results.

Event Timeline

Note that this analysis was originally part of T254388 and was already on @nettrom_WMF's radar to finish up when he returns from vacation on September 8, but we decided to break the analysis into a separate ticket since the A/B test itself is complete.

LGoto triaged this task as High priority.Sep 8 2020, 5:07 PM
LGoto edited projects, added Product-Analytics (Kanban); removed Product-Analytics.

Moving this out of "Doing" as we've discovered that the data gathering had a bug leading to us being unable to determine which algorithm produced a clicked result when interleaving occurred. Will pick up the analysis again once the second iteration of the test has been completed. And yes, we'll QA the data after relaunch to make sure it's working correctly.

nettrom_WMF added subscribers: Ramsey-WMF, mpopov.

The analysis has been done and can be found in this Jupyter/R notebook. We find a slight preference for the control condition (legacy search) over Media Search.

I'd like to extend a thank you to @mpopov for reviewing this work! :)

@CBogen & @Ramsey-WMF : let me know what questions you might have about this.

We're unsure if the finding is trustworthy. I'm moving this back to "Doing" to dig further into this.

A huge thanks to @mpopov for doing a lot of work on this, improving the data processing code and figuring out ways massage the data from SearchSatisfaction to pull out the insights!

I've updated the notebook on GitHub with the improved analysis. We've extensively QAed this notebook as well as the old processing code in order to understand where things work and where they break. As far as I can tell, this is as good as we can get it for now, in that if we are to extract more data we'll need to throw a lot more time at it. Instead, I think we should call this good, and if we'll be running additional tests I recommend the instrumentation code changes to explicitly store what team/algorithm produced a clicked/visited result to remove the challenges of mapping the click/visit to a SERP in post-processing.

The conclusion changes in the new notebook: we find a strong preference for the new Media Search algorithm.

Now that the subtask is resolved and the notebook is accessible, I'm closing this task as well.