Maniphest T132077

how it changes the result
Closed, ResolvedPublic2 Estimated Story Points
Actions

Assigned To

Authored By

	• Deskana
	Apr 7 2016, 5:31 PM

Description

Per T129608#2187453, we should re-analyse the data from the phrase rescore boost test with single word queries excluded to see if this affects the results of our analysis.

If the results are the same:

Public draft in T129608#2166790 with an added comment that a re-analysis was done and that it didn't change the result

If the results are different:

Write a new report instead, highlighting the new results

Related Objects
Search...

Status	Assigned	Task
Resolved	• Deskana	T129593 [EPIC] Based on the promising results in RelForge, run an AB test with a phrase rescore boost of 1
Resolved	mpopov	T132077 Re-analyse data from phrase rescore boost of 1 A/B test with single word queries excluded to see if/how it changes the result
Resolved	mpopov	T129608 Perform analysis of results of A/B test with phrase rescore boost of 1
Resolved	EBernhardson	T129607 Turn off A/B test with phrase rescore boost of 1
Resolved	EBernhardson	T129605 Verify data pipeline for A/B test with phrase rescore boost of 1
Resolved	EBernhardson	T129603 Turn on A/B test for test with phrase rescore boost of 1
Resolved	EBernhardson	T129601 Write A/B test code for test with phrase rescore boost of 1

Event Timeline

• Deskana created this task.Apr 7 2016, 5:31 PM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptApr 7 2016, 5:31 PM

• Deskana mentioned this in T129608: Perform analysis of results of A/B test with phrase rescore boost of 1.Apr 7 2016, 5:32 PM

• Deskana updated the task description. (Show Details)

• Deskana added a subtask: T129608: Perform analysis of results of A/B test with phrase rescore boost of 1.

• Deskana closed subtask T129608: Perform analysis of results of A/B test with phrase rescore boost of 1 as Resolved.

• Deskana triaged this task as Medium priority.Apr 7 2016, 5:34 PM

• Deskana added a parent task: T129593: [EPIC] Based on the promising results in RelForge, run an AB test with a phrase rescore boost of 1.

• Deskana moved this task from Needs triage to Up Next on the Discovery-Analysis board.

EBernhardson subscribed.Apr 7 2016, 11:02 PM

Semi-related, i notice the last document says something about ~135k sessions in the test period. Querying event logging i get (slightly chopping off the ends of the test):

mysql:research@analytics-store.eqiad.wmnet [log]> select count(distinct event_searchSessionId), event_subTest from TestSearchSatisfaction2_15357244 where timestamp between 20160315000000 and 20160323120000 group by event_subTest;

+---------------------------------------+----------------+
| count(distinct event_searchSessionId) | event_subTest  |
+---------------------------------------+----------------+
|                                198228 | NULL           |
|                                178843 | phraseBoostEq1 |
+---------------------------------------+----------------+
2 rows in set (40.69 sec)

Doesn't this suggest there are 376k sessions?

EDIT:

Realised i didn't limit this to sessions that include full text queries, then it makes sense:

select count(distinct event_searchSessionId), event_subTest from TestSearchSatisfaction2_15357244 where timestamp between 20160315000000 and 20160323120000 group by event_subTest;Database changed
mysql:research@analytics-store.eqiad.wmnet [log]> select count(distinct event_searchSessionId), event_subTest from TestSearchSatisfaction2_15357244 where timestamp between 20160315000000 and 20160323120000 and event_source = 'fulltext' group by event_subTest;
+---------------------------------------+----------------+
| count(distinct event_searchSessionId) | event_subTest  |
+---------------------------------------+----------------+
|                                 75212 | NULL           |
|                                 65883 | phraseBoostEq1 |
+---------------------------------------+----------------+

Also perhaps interesting, i talked to @JustinOrmont about evaluating tests in irc. The referenced spreadsheet applies frequentist methods and is at https://docs.google.com/spreadsheets/d/1sY8iO7XlmauJbZ7iZOd4E5ofqLzACW5S1-cVldlUssw/edit#gid=0:

16:06 < JustinO> When evaluating results like there, I'd put the numbers in to the spread sheet I sent you. Though more generically, my flight score cards report the metrics which have obtained statistical 
                 significance. They show the raw metric scores of A & B, then show the gain/loss if (and only if) the change is statistically significant.
16:15 < ebernhardson> i'm not entirely sure how to define the "positive samples" part of the spreadsheet for this latest test, we have a largish number of sessions to work with (i think the doc says ~135k 
                      for the week) but no single metric for success/failure of a session
16:19 < ebernhardson> or is positive samples just the set of possibly effected queries?
16:22 < JustinO> positive samples depend on how you define it (and it's symmetrical too). eg: positive = # of clicks in pos 1-3. sample size = number of queries.
16:23 < JustinO> the symmetrical part means you'll get the same answer of significance if you choose the other side. eg: positive = # of clicks in position 4+.
16:24 < JustinO> or, you may rather measure: sample size = number of clicks
16:24 < JustinO> not number of queries.
16:29 < JustinO> control: samples = total number of clicks in control group. positive samples = number of click in position 1-3 in the control group.
16:29 < JustinO> variation: samples = total number of clicks in test group. positive samples = number of click in position 1-3 in the test group.
16:30 < ebernhardson> ok that makes sense, thanks!
16:36 < JustinO> I don't see the total number of clicks in the experiment. but it does list "134,952 independent full-text search session". Let's assume the control & variation group both got ((134,952 / 
                 2) * 0.7) clicks. The 0.7 is a fudge factor to account for people not clicking on anything in the session (abandonment rate). From Fig. 3, ~72.9% of clicks in the control, and ~73.7% of 
                 clicks in the variation group clicked in position 1-3. (note this is a
16:36 < ebernhardson> we have a huge abandonment rate, so the total clicks works out to
16:36 < JustinO> so you sheet would contain:
16:37 < ebernhardson> | count(1) | sum(if(event_position between 1 and 3, 1, 0)) | event_subTest  |
16:37 < ebernhardson> +----------+-----------------------------------------------+----------------+
16:37 < ebernhardson> |    30299 |                                         20716 | NULL           |
16:37 < ebernhardson> |    21551 |                                         15273 | phraseBoostEq1 |
16:37 < JustinO> Control 47233 34432.857
16:37 < JustinO> Variation 47233 34810.721
16:37 < JustinO> awesome
16:38 < ebernhardson> grr, my screenshot thingy is acting up
16:39 < ebernhardson> but it puts positive at 68.37% control, 70.87% test with a very tiny p value
16:39 < JustinO> yep
16:40 < ebernhardson> so interestingly, by that metric the phrase boost is a positive improvement. small but positive
16:40 < JustinO> correct
16:41 < JustinO> specifically in metric of: number of clicks in pos 1-3.

The click throughs there were measured using the query:

select count(1), sum(if(event_position between 1 and 3, 1, 0)), event_subTest from log.TestSearchSatisfaction2_15357244 where timestamp between 20160315000000 and 20160323120000 and event_action = 'visitPage' group by event_subTest;

EDIT:
I mistakenly included autocomplete sessions above, updated query:

mysql:research@analytics-store.eqiad.wmnet [log]> select count(1), sum(if(event_position between 1 and 3, 1, 0)), event_subTest from log.TestSearchSatisfaction2_15357244 where timestamp between 20160315000000 and 20160323120000 and event_action = 'visitPage' and event_source = 'fulltext' group by event_subTest;
+----------+-----------------------------------------------+----------------+
| count(1) | sum(if(event_position between 1 and 3, 1, 0)) | event_subTest  |
+----------+-----------------------------------------------+----------------+
|    23039 |                                         16944 | NULL           |
|    16250 |                                         12566 | phraseBoostEq1 |
+----------+-----------------------------------------------+----------------+

Also i'm worried i'm just doing this wrong, but i pulled the session abandonment numbers for the test period:

mysql:research@analytics-store.eqiad.wmnet [log]> select event_subTest, count(1), sum(has_click) from (select event_searchSessionId, event_subTest, sum(if(event_action = 'visitPage', 1, 0)) > 0 as has_click from TestSearchSatisfaction2_15357244 where timestamp between 20160315000000 and 20160323120000 and event_action in ('searchResultPage', 'visitPage') group by event_searchSessionId, event_subTest) x group by event_subTest;
+----------------+----------+----------------+
| event_subTest  | count(1) | sum(has_click) |
+----------------+----------+----------------+
| NULL           |   198038 |          20423 |
| phraseBoostEq1 |   178677 |          17200 |
+----------------+----------+----------------+
2 rows in set (1 min 10.92 sec)

This shows abandonment in the ~90% range, rather than ~70% shown in the report.

EDIT:
Mistakenly included autocomplete above, corrected numbers still come in around 80% abandonment though:

mysql:research@analytics-store.eqiad.wmnet [log]> select event_subTest, count(1), sum(has_click) from (select event_searchSessionId, event_subTest, sum(if(event_action = 'visitPage', 1, 0)) > 0 as has_click from TestSearchSatisfaction2_15357244 where timestamp between 20160315000000 and 20160323120000 and event_action in ('searchResultPage', 'visitPage') and event_source = 'fulltext' group by event_searchSessionId, event_subTest) x group by event_subTest;
+----------------+----------+----------------+
| event_subTest  | count(1) | sum(has_click) |
+----------------+----------+----------------+
| NULL           |    75201 |          16311 |
| phraseBoostEq1 |    65879 |          13619 |
+----------------+----------+----------------+

Another significance tester - http://www.evanmiller.org/ab-testing/chi-squared.html#!20716/30299;15273/21551@95
Similar bayesian calculator - http://developers.lyst.com/bayesian-calculator/

I don't know enough about bayesian techniques to speak coherently about its usefulness for A/B testing. I suspect both classical and bayesian techniques will give similar results for similar questions. Eg: "did the percent of clicks in position 1-3 increase in the test group?"

• Deskana edited projects, added Discovery-Search (Current work); removed Discovery-Analysis.Apr 12 2016, 8:12 PM

• Deskana edited projects, added Discovery-Analysis (Current work); removed Discovery-Search (Current work).

mpopov claimed this task.Apr 19 2016, 8:10 PM

mpopov set the point value for this task to 2.

Restricted Application added a subscriber: TerraCodes. · View Herald TranscriptApr 19 2016, 8:10 PM