Page MenuHomePhabricator

Write Glent M0 A/B test report
Closed, ResolvedPublic5 Estimated Story Points

Description

As a member of the search team, I want to turn the raw data from the Glent M0 A/B test into a report so that I can evaluate the impact and quality of Glent M0 and determine whether to enable Glent M0 (session similarity).

Data:

AC:

  • Debug and turn on the SearchSatisfaction directed acyclic graph (DAG) in Airflow and run over at least 30 days of source data in Hadoop
  • Write a summary report demonstrating in what ways the source data indicates that M0 is better and in which ways the numbers aren't reliable

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald Transcript

While reviewing the related code we found a bug in the UI of the provided suggestions that essentially invalidates the AB testing done. Essentially the UI was misrepresenting which query was run, discouraging users from clicking the suggested query (and generally being hard to interperet). The fix[1] has been merged and will deploy with the next train. The test should be re-run once fixed and the suggestions pipeline has been verified.

[1] https://gerrit.wikimedia.org/r/#/c/mediawiki/extensions/CirrusSearch/+/554603/

This ended up further delayed due to not being able to differentiate queries suggested by glent from queries suggested in general. Essentially glent does not suggest queries often enough to make a measurable impact when mixed in with the other suggestions. We expect that by logging which queries were suggested by glent we can compare the metrics on those queries vs the metrics on our normal suggester, and have better data. The data collection update has shipped and we should be turning the test back on soon.

CBogen updated the task description. (Show Details)
CBogen set the point value for this task to 5.Aug 24 2020, 5:14 PM

The relevant metrics we have stored are in superset: https://superset.wikimedia.org/r/334. This is specifically for the time range of March 13 2020 through May 15 2020.

Copied here since superset has limited access:

bucketPercentage of Searches Shown a Query SuggestionPercentage of Impressions Clicking a Suggested QuerySearch RequestsSearches Shown a Query SuggestionSearches Clicking a Suggested Query
glent_m021.0%1.5%67.7M14.2M217k
control20.3%1.5%72M14.6M220k
bucketPercentage of Zero Result Searches Automatically RewrittenZero Initial Search ResultsSearches Automatically Rewritten
glent_m054.8%12.6M6.88M
control54.4%12.6M6.89M
bucketSearch Automatically RewrittenPercentage of Searches Clicking a Search ResultSearch RequestsSearches Clicking a Search Result
glent_m0false31.8%327k104k
controlfalse30.8%315k97.1k
glent_m0true20.3%6.88M1.39M
controltrue19.8%6.89M1.36M
Glent Session Similarity based query suggestions AB test

Between March 13, 2020 and May 15, 2020 (63 days, 9 weeks) 50% of Special:Search traffic to enwiki, dewiki and frwiki were augmented with glent session-similarity (Method 0) query suggestions. Over this time glent had the opportunity to provide suggestions to 67 million search requests. Across all metrics measured the inclusion of session-similarity based query suggestions improves by small but measurably significant amounts. The rest of the report will look more specifically into conversion rates of the various steps in the user flow between issuing a query and satisfying their information need.

Expected values were calculated via Bayesian inference using the control values as the prior.

Do we have a suggestion?

The first step of the users experience with search suggestions is if we provide suggestions at all. The search system has the opportunity to either return a suggestion to the user, or directly run the suggested query and return those results instead. In practice we only run the suggested query when the initial query returned no results. Throughout the analyis we will look at the automatically rewritten queries separately from the user-selected suggestions.

The proportion of all Special:Search requests presented dym suggestions increased from 20.30% to 20.61%, an uplift of 1.6%

bucketvalue95% ci
control0.20304[0.20298, 0.20311]
glent0.20622[0.20615, 0.20628]

The proportion of zero result searches that were rewritten increased from 54.43% to 54.62%, an uplift of 0.3%.

bucketvalue95% ci
control0.54449[0.54429, 0.54468]
glent0.54616[0.54597, 0.54636]

Taken in isolation this metric would not be meaningful, an algorithm could provide random garbage to lots of queries and improve the rates of query suggestion. An uplift here is only meaningful if there is not a coinciding decrease in interaction with suggested queries and suggested query results.

This does not capture what proportion of searches would have been answered by the traditional phrase sugester but were instead answered by glent. It seems very likely glent is serving a higher % of suggestion traffic than the uplift indicates.

The phrase suggester is disallowed from providing query suggestions to queries with more than 15k results, while glent suggestions do not share that limitation. It's possible some of the uplift is for queries that have rejected phrase suggester suggestions, but this is not accounted for in the analysis.

When presented a suggestion, do users interact with it?

When both query results and a suggestion are available they are both presented to the user and we measure the percentage of users that select the suggested query. We expect this metric to be correlated with the quality of suggestions, with better suggestions having higher interaction rates. This interacts with the % of queries shown a suggestion, maintaining the same value while increasing the % of queries shown a suggestion is still a win.

The proportion search results presented a suggestion that interact with it increased from 1.50 % to 1.51%, an uplift of 0.7%.

bucketvalue95% ci
control0.01507[0.01503, 0.01512]
glent0.0151[0.01513, 0.01522]

While this is the narrowest of our metrics, it still coms back with significant values. Sampling from the distributions we expect that only 1 in 1000 tests would come back with control as the winner. (p-ish value of 0.001).

The fraction of users that this satisfies is a small subset of user queries. Only 20.6% of queries see a suggestion, and only 1.5% of those suggestions are selected, giving 0.3% of overall search queries. It would take a dramatic change in this metric to have much impact.

Do users interact with the search results of suggested queries?

Once we've rewritten the search terms the users are presented results and we measure the % of users interacting with the provided search results. As before we run separate analysis for queries that were automatically rewritten and queries where the user selected the suggested query as their behaviour differs greatly.

The proportion of automatically rewritten queries that users interact with increased from 19.78% to 20.00%, an uplift of 1.1%.

bucketvalue95% ci
control0.19796[0.19775, 0.19827]
glent0.20024[0.20003, 0.20046]

The proportion of self-selected suggestions that users interact with results of increased from 30.8% to 31.3%, an uplift of 1.6%.

bucketvalue95% ci
control0.30825[0.30711, 0.30939]
glent0.31308[0.31195, 0.31421]

In terms of impact, a 1.1% uplift in interaction with automatically rewritten zero result queries is the largest in terms of absolute number of queries satisfied. The improvement in interaction rates combined with the increase in queries we have suggestions for gives an overall uplift of 1.5% when saving zero result queries.

Conclusion

In every metric the combination of the phrase suggester + glent shows a small but statistically significant improvement over the phrase suggester on it's own with a typical uplift of around 1%. Glent is providing suggestions for queries we were not previously able to generate suggestions for, and overall engagement with the suggested query results has equally improved. This data does not capture the exact behaviour of glent in isolation, so we have limited understanding of how the specific suggestions from glent are performing, but we can say with confidence that the combination is better than the phrase suggester on it's own.

Excellent analysis, @EBernhardson! We talked about writing up a paragraph about the results, but this is much more detailed and in-depth. Thanks for taking care of it!