Page MenuHomePhabricator

Analyse results of the swap2and3 search test
Closed, ResolvedPublic

Description

Search ran a test swapping the second and third results to see how much our users care about the position of the result, vs the actual content of the result. This should be a relatively simple one to analyse, we should look at click through rates in both control (no subtest) and swapped (swap2and3) to see if there was any change in clickthrough by result position. Make sure when doing analysis we are only looking at full text search, autocomplete was not effected by this test.

mysql:research@analytics-store.eqiad.wmnet [(none)]> select min(timestamp), max(timestamp) from log.TestSearchSatisfaction2_15357244 where event_subTest = 'swap2and3';
+----------------+----------------+
| min(timestamp) | max(timestamp) |
+----------------+----------------+
| 20160406064000 | 20160523073641 |
+----------------+----------------+

Event Timeline

Restricted Application added subscribers: Zppix, Aklapper. · View Herald TranscriptMay 23 2016, 4:58 PM
EBernhardson updated the task description. (Show Details)May 23 2016, 5:00 PM
EBernhardson added a comment.EditedMay 23 2016, 5:35 PM

Not sure why there are still a few swap2and3 coming in, that test was reverted april 26 :S Best bet is probably to look at apr 7 - apr 25.

I did some very basic looking at the data we have. The swap2and3 bucket should be the same size as the null bucket but is not. In the future we should probably make explicit control buckets rather than utilizing null as perhaps some users aren't getting the new js?

Clickthrough to positions 2 and 3 via visitPage events

select event_subTest, event_position, count(1) as total_clicks from log.TestSearchSatisfaction2_15357244 where event_action = 'visitPage' and timestamp between 20160407000000 and 20160426000000 and event_position is not null and event_position between 2 and 3 and event_source='fulltext' group by event_subTest, event_position;

+---------------+----------------+--------------+
| event_subTest | event_position | total_clicks |
+---------------+----------------+--------------+
| NULL          |              2 |         8001 |
| NULL          |              3 |         4195 |
| swap2and3     |              2 |         4913 |
| swap2and3     |              3 |         3829 |
+---------------+----------------+--------------+

Clickthrough to positions 2 and 3 via click events

select event_subTest, event_position, count(1) as total_clicks from log.TestSearchSatisfaction2_15357244 where event_action = 'click' and timestamp between 20160407000000 and 20160426000000 and event_position is not null and event_position between 2 and 3 and event_source = 'fulltext' group by event_subTest, event_position;

+---------------+----------------+--------------+
| event_subTest | event_position | total_clicks |
+---------------+----------------+--------------+
| NULL          |              2 |         7873 |
| NULL          |              3 |         4122 |
| swap2and3     |              2 |         4821 |
| swap2and3     |              3 |         3766 |
+---------------+----------------+--------------+

Number of distinct sessions in each bucket

select event_subTest, count(distinct event_searchSessionId) from log.TestSearchSatisfaction2_15357244 where timestamp between 20160407000000 and 20160426000000 and event_source = 'fulltext' group by event_subTest;

+----------------+---------------------------------------+
| event_subTest  | count(distinct event_searchSessionId) |
+----------------+---------------------------------------+
| NULL           |                                164359 |
| swap2and3      |                                143943 |
+----------------+---------------------------------------+

A very naive analysis:

We need to figure out what's going on with visitPage vs click events before deciding much ... but it looks like (maybe) clicks to position 2 drop, but clicks to position 3 don't rise to compensate. People are perhaps not even looking past position 2?

debt triaged this task as Normal priority.May 31 2016, 8:27 PM
debt moved this task from Needs triage to Up Next on the Discovery-Analysis board.
debt added a subscriber: debt.

Let's go ahead and investigate this...

First draft: http://wikimedia-research.github.io/Discovery-Research-Portal/swap2and3/
Still trying to figure out how to interpret the result...

debt added a comment.Mar 3 2017, 10:32 PM

so odd that this happens...

We can see that test group users are less likely to click on the second result first than the control group, while they are more likely to click on the third result first.

mpopov added a subscriber: mpopov.Mar 10 2017, 12:45 AM

Second draft: http://wikimedia-research.github.io/Discovery-Research-Portal/swap2and3/

@debt, instead of "people care more about position than content", I'm guessing users may tend to look at only the first two results, when they find the first two are not relevant, some of them start to look at the third result and care more about the actual content. That would explain the odd results we saw... However, this test cannot prove my guess... :(

debt added a subscriber: TJones.Mar 16 2017, 1:05 PM

Looks very interesting - and based on this comment:

Additionally, instead of making explicit control buckets, we simply treat those users who didn’t get assigned to test group as control group users. We suspect that this behavior results in putting all users who don’t have the new javascript into control group automatically, so some metrics may be biased.

it sounds like we should run another test to see if our visitors are more likely to click, based on the position of the result, vs the actual content of the result, do you agree?

@TJones - would you have some time to review @chelsyx's findings and would you agree that doing another test of the 2nd and 3rd positions would be of interest and useful?

I will take a look!

Comments sent to Chelsy directly.

I've got a follow up idea I want to work out with Chelsy, Erik, and Mikhail that might be more interesting than just a 2nd vs 3rd swap...

debt added a comment.Mar 16 2017, 9:25 PM

Cool, @TJones - let me know where I can help!

@TJones is that followup idea related to propensity svmrank? In the paper Unbiased Learning-to-Rank with Biased Feedback they suggest a a simplistic method of swapping the first result with various positions, and measuring the click through rates of the first position vs other positions to measure the users propensity to ordering.

@EBernhardson, my follow-up idea wasn't particularly tied to anything, just something that came to me while reviewing an earlier draft of the report. @chelsyx captured it nicely in T167824.

Sounds similar, though: just move some stuff around, see how much it changes click percentages and extrapolate weights for your exponential decay equation. If you think the paper makes a good case for only moving the top item, that would be helpful info to include on the other ticket (T167824), where Chelsy's asked for experimental design ideas.

Good work, @chelsyx! Minor changes here and there: https://github.com/wikimedia-research/Discovery-Search-Test-Swap2and3/pull/1

And then I think you're good to go!