Write Glent M0 A/B test report
Closed, ResolvedPublic5 Estimated Story Points
Actions

Assigned To

Authored By

	dcausse
	Nov 5 2019, 10:02 AM

Description

As a member of the search team, I want to turn the raw data from the Glent M0 A/B test into a report so that I can evaluate the impact and quality of Glent M0 and determine whether to enable Glent M0 (session similarity).

Data:

AC:

Debug and turn on the SearchSatisfaction directed acyclic graph (DAG) in Airflow and run over at least 30 days of source data in Hadoop
Write a summary report demonstrating in what ways the source data indicates that M0 is better and in which ways the numbers aren't reliable

Related Objects
Search...

Status	Assigned	Task
Duplicate	None	T235828 [Epic] Enhance search suggestions to allow for easier access to results
Invalid	None	T235829 Glent method 0 (session reformulation) A/B tested and deployed by end of Q3
Open	None	T212884 [EPIC] Improve Search Suggestions with NLP (Did You Mean / Glent)
Resolved	Gehel	T212888 Implement NLP Search Suggestion Method 0 for English
Resolved	EBernhardson	T237364 Write Glent M0 A/B test report
Resolved	EBernhardson	T238246 Add "source" to A/B test schema for DYM suggestions
Resolved	TJones	T238247 Run Null A/B test for DYM suggestions

Event Timeline

dcausse created this task.Nov 5 2019, 10:02 AM

Restricted Application edited projects, added Discovery-Search; removed Discovery-Search (Current work). · View Herald TranscriptNov 5 2019, 10:02 AM

Restricted Application added a subscriber: Aklapper. · View Herald Transcript

dcausse added a parent task: T212888: Implement NLP Search Suggestion Method 0 for English.Nov 5 2019, 10:04 AM

EBernhardson moved this task from needs triage to Current work on the Discovery-Search board.Dec 12 2019, 8:34 PM

EBernhardson edited projects, added Discovery-Search (Current work); removed Discovery-Search.

TJones moved this task from Incoming to Waiting on the Discovery-Search (Current work) board.Dec 16 2019, 6:31 PM

While reviewing the related code we found a bug in the UI of the provided suggestions that essentially invalidates the AB testing done. Essentially the UI was misrepresenting which query was run, discouraging users from clicking the suggested query (and generally being hard to interperet). The fix[1] has been merged and will deploy with the next train. The test should be re-run once fixed and the suggestions pipeline has been verified.

[1] https://gerrit.wikimedia.org/r/#/c/mediawiki/extensions/CirrusSearch/+/554603/

This ended up further delayed due to not being able to differentiate queries suggested by glent from queries suggested in general. Essentially glent does not suggest queries often enough to make a measurable impact when mixed in with the other suggestions. We expect that by logging which queries were suggested by glent we can compare the metrics on those queries vs the metrics on our normal suggester, and have better data. The data collection update has shipped and we should be turning the test back on soon.

Gehel closed subtask T238246: Add "source" to A/B test schema for DYM suggestions as Resolved.Mar 11 2020, 3:21 PM

EBernhardson moved this task from Waiting to Ready for Dev -- SWE on the Discovery-Search (Current work) board.Apr 20 2020, 5:33 PM

CBogen updated the task description. (Show Details)Jul 27 2020, 5:27 PM

CBogen updated the task description. (Show Details)

CBogen moved this task from Ready for Dev -- SWE to Incoming on the Discovery-Search (Current work) board.Jul 27 2020, 5:33 PM

CBogen updated the task description. (Show Details)Aug 24 2020, 5:11 PM

CBogen set the point value for this task to 5.Aug 24 2020, 5:14 PM

CBogen moved this task from Incoming to Ready for Dev -- SWE on the Discovery-Search (Current work) board.

The relevant metrics we have stored are in superset: https://superset.wikimedia.org/r/334. This is specifically for the time range of March 13 2020 through May 15 2020.

Copied here since superset has limited access:

bucket	Percentage of Searches Shown a Query Suggestion	Percentage of Impressions Clicking a Suggested Query	Search Requests	Searches Shown a Query Suggestion	Searches Clicking a Suggested Query
glent_m0	21.0%	1.5%	67.7M	14.2M	217k
control	20.3%	1.5%	72M	14.6M	220k

bucket	Percentage of Zero Result Searches Automatically Rewritten	Zero Initial Search Results	Searches Automatically Rewritten
glent_m0	54.8%	12.6M	6.88M
control	54.4%	12.6M	6.89M

bucket	Search Automatically Rewritten	Percentage of Searches Clicking a Search Result	Search Requests	Searches Clicking a Search Result
glent_m0	false	31.8%	327k	104k
control	false	30.8%	315k	97.1k
glent_m0	true	20.3%	6.88M	1.39M
control	true	19.8%	6.89M	1.36M

Glent Session Similarity based query suggestions AB test

Between March 13, 2020 and May 15, 2020 (63 days, 9 weeks) 50% of Special:Search traffic to enwiki, dewiki and frwiki were augmented with glent session-similarity (Method 0) query suggestions. Over this time glent had the opportunity to provide suggestions to 67 million search requests. Across all metrics measured the inclusion of session-similarity based query suggestions improves by small but measurably significant amounts. The rest of the report will look more specifically into conversion rates of the various steps in the user flow between issuing a query and satisfying their information need.

Expected values were calculated via Bayesian inference using the control values as the prior.

Do we have a suggestion?

The first step of the users experience with search suggestions is if we provide suggestions at all. The search system has the opportunity to either return a suggestion to the user, or directly run the suggested query and return those results instead. In practice we only run the suggested query when the initial query returned no results. Throughout the analyis we will look at the automatically rewritten queries separately from the user-selected suggestions.

The proportion of all Special:Search requests presented dym suggestions increased from 20.30% to 20.61%, an uplift of 1.6%

bucket	value	95% ci
control	0.20304	[0.20298, 0.20311]
glent	0.20622	[0.20615, 0.20628]

The proportion of zero result searches that were rewritten increased from 54.43% to 54.62%, an uplift of 0.3%.

bucket	value	95% ci
control	0.54449	[0.54429, 0.54468]
glent	0.54616	[0.54597, 0.54636]

Taken in isolation this metric would not be meaningful, an algorithm could provide random garbage to lots of queries and improve the rates of query suggestion. An uplift here is only meaningful if there is not a coinciding decrease in interaction with suggested queries and suggested query results.

This does not capture what proportion of searches would have been answered by the traditional phrase sugester but were instead answered by glent. It seems very likely glent is serving a higher % of suggestion traffic than the uplift indicates.

The phrase suggester is disallowed from providing query suggestions to queries with more than 15k results, while glent suggestions do not share that limitation. It's possible some of the uplift is for queries that have rejected phrase suggester suggestions, but this is not accounted for in the analysis.

When presented a suggestion, do users interact with it?

When both query results and a suggestion are available they are both presented to the user and we measure the percentage of users that select the suggested query. We expect this metric to be correlated with the quality of suggestions, with better suggestions having higher interaction rates. This interacts with the % of queries shown a suggestion, maintaining the same value while increasing the % of queries shown a suggestion is still a win.

The proportion search results presented a suggestion that interact with it increased from 1.50 % to 1.51%, an uplift of 0.7%.

bucket	value	95% ci
control	0.01507	[0.01503, 0.01512]
glent	0.0151	[0.01513, 0.01522]

While this is the narrowest of our metrics, it still coms back with significant values. Sampling from the distributions we expect that only 1 in 1000 tests would come back with control as the winner. (p-ish value of 0.001).

The fraction of users that this satisfies is a small subset of user queries. Only 20.6% of queries see a suggestion, and only 1.5% of those suggestions are selected, giving 0.3% of overall search queries. It would take a dramatic change in this metric to have much impact.

Do users interact with the search results of suggested queries?

Once we've rewritten the search terms the users are presented results and we measure the % of users interacting with the provided search results. As before we run separate analysis for queries that were automatically rewritten and queries where the user selected the suggested query as their behaviour differs greatly.

The proportion of automatically rewritten queries that users interact with increased from 19.78% to 20.00%, an uplift of 1.1%.

bucket	value	95% ci
control	0.19796	[0.19775, 0.19827]
glent	0.20024	[0.20003, 0.20046]

The proportion of self-selected suggestions that users interact with results of increased from 30.8% to 31.3%, an uplift of 1.6%.

bucket	value	95% ci
control	0.30825	[0.30711, 0.30939]
glent	0.31308	[0.31195, 0.31421]

In terms of impact, a 1.1% uplift in interaction with automatically rewritten zero result queries is the largest in terms of absolute number of queries satisfied. The improvement in interaction rates combined with the increase in queries we have suggestions for gives an overall uplift of 1.5% when saving zero result queries.

Conclusion

In every metric the combination of the phrase suggester + glent shows a small but statistically significant improvement over the phrase suggester on it's own with a typical uplift of around 1%. Glent is providing suggestions for queries we were not previously able to generate suggestions for, and overall engagement with the suggested query results has equally improved. This data does not capture the exact behaviour of glent in isolation, so we have limited understanding of how the specific suggestions from glent are performing, but we can say with confidence that the combination is better than the phrase suggester on it's own.

EBernhardson claimed this task.Oct 14 2020, 6:51 PM

EBernhardson moved this task from Ready for Dev -- SWE to Needs review on the Discovery-Search (Current work) board.

Excellent analysis, @EBernhardson! We talked about writing up a paragraph about the results, but this is much more detailed and in-depth. Thanks for taking care of it!

Gehel moved this task from Needs review to Needs Reporting on the Discovery-Search (Current work) board.Oct 26 2020, 6:12 PM

TJones closed subtask T238247: Run Null A/B test for DYM suggestions as Resolved.Oct 26 2020, 6:14 PM

Gehel closed this task as Resolved.Nov 9 2020, 1:06 PM

Write Glent M0 A/B test reportClosed, ResolvedPublic5 Estimated Story PointsActions