Page MenuHomePhabricator

Run A/B test on the search suggester to measure zero results rate, starting on 2015-09-08
Closed, ResolvedPublic


The Discovery Department's Q1 goal is to reduce the zero results rate. We've built out a new API (T105746) as an experiment. Initial tests of the API (T109729) were promising. Now let's test it more thoroughly.

Area: Search bar at the top right on desktop on English Wikipedia
Bucketing: 0.01% control, 0.01% experimental group
Measuring: zero results rate, hoping it will decrease with the experimental group
Start date: 2015-09-08
Duration: two weeks

Event Timeline

Deskana raised the priority of this task from to High.
Deskana updated the task description. (Show Details)
Deskana added a subscriber: Deskana.
Deskana renamed this task from Run A/B test on the search suggester to measure zero results rate to Run A/B test on the search suggester to measure zero results rate, starting on 2015-09-08.Sep 1 2015, 5:20 PM
Deskana set Security to None.

Science details for transparency: we agreed we want to be able to detect that the experimental group is at least 1.5 times more likely to get results than the control group, with 99% power to detect this effect and 95% confidence (meaning that a p-value less than 0.05 will be called significant). We guessed that 65% of the control group is going to get some results, which means that our sample size should be...

R> wmf::sample_size_odds(odds_ratio = 1.5, p_control = 0.65, power = 0.99, conf_level = 0.95, sample_ratio = 1)

2133 observations, although that's with a guess of 65% (= 35% zero results rate) prevalence within the control group. The final number will change as we figure out a more accurate, enwiki-specific prevalence.

For now, we are going with a sampling rate of 0.01% and then we are going to sub-sample down to the actual sample size after we collect the data.

Change 235345 had a related patch set uploaded (by EBernhardson):
[WIP] A/B test for experimental suggestions api

Typically when opting users into a user test we either do it on a per page bases or a a longer session basis to provide a good experience for the user. I think in this case getting different suggestions from a different page is acceptable, but within a single page load the user will see either the new suggestions or the old ones. Additionally since this is a search as you type deal, there will be multiple events per user that is opted into the test. Will that work or do i need to adjust things?

I've got the following sketched out as the schema:

There is something else that may or may not effect the test. At the minimum its something we don't consider when measuring our zero result rate:

There isn't any throttling of the suggestions in mediawiki. Each time you type a new letter the old request is canceled and a new request is issued. When we log these on the backend we don't know that a request was canceled, php always runs to the end of the request even if it cuts off.

Naively logging in the frontend we would only log the responses that are shown to the user. This would be consistent across the test, just not with our other measurements. It seems sane i just wanted to make it known.

Change 235391 had a related patch set uploaded (by EBernhardson):
Enable experiment with experimental completion suggester

It looks like opensearch is cached by varnish but I'm not sure that the new api will be cached will this affect the test?

Change 235391 merged by jenkins-bot:
Enable experiment with experimental completion suggester

Test is currently running, so marking as resolved.