Page MenuHomePhabricator

Analyse results of TextCat A/B test
Closed, DeclinedPublic4 Estimated Story Points

Description

After the TextCat A/B test is turned off (see T134319), the data should be analysed to see whether the test had a significant impact.

Event Timeline

A couple things (although certainly more) i think we could look at:

Click through to the alternate wiki
Query reformulation after being showed alternate wiki results

A couple things (although certainly more) i think we should look at:

This topic came up in a discussion today with @EBernhardson and @dcausse.

Click through to the alternate wiki
Query reformulation after being showed alternate wiki results

  • A clarification: tracking query reformulation by the user is interesting and useful in its own right, as a way of getting possible alternative versions of a query (e.g., for automatic correction). In this case, the idea is that a reformulated query from the user without a clickthrough to another wiki after presenting other-language cross-wiki results indicates that the results were not useful.

Other ideas that came up:

  • looking at satisfaction metrics for all queries identified as being in another language in one big bucket, vs looking at by-language buckets. (e.g., on enwiki, results in Spanish/from eswiki are good, but results in French/from frwiki are not.)
  • looking at satisfaction metrics for queries based on number of cross-wiki results (1 result may be a fluke, 5000 results means the language is probably right).

I'll also try to get others to take a peek over here and add more.

Perhaps interesting, but maybe not a factor in deciding to keep the feature:

  • % of zero result requests that now get results
  • % of requests that were provided inter-wiki results that click on one
  • looking at satisfaction metrics for queries based on number of cross-wiki results (1 result may be a fluke, 5000 results means the language is probably right).

@TJones Hm… Do you have suggestions for the threshold we can use to determine this on the whole dataset? We won't be able to look at each of 100K+ sessions individually.

Note to future @mpopov: the extra data field in the TSS2 table will have 3 values (actually detected language, wiki queried, and number of results) that will need to be separated into 3 columns.

debt subscribed.

moving into sprint for working on this week

mpopov moved this task from Backlog to In progress on the Discovery-Analysis (Current work) board.
mpopov set the point value for this task to 4.

Cannot proceed with analysis as data is too faulty to be reliable. We will fix the EL and relaunch the test. See follow-up: T137158