Page MenuHomePhabricator

Update/repair Search A/B Test autoreporter
Open, MediumPublic

Description

autoreporter was a tool that @chelsyx (no longer at the Foundation) wrote – based on our manually-created reports (e.g. First assessment of learning-to-rank) – to generate reports of Search Platform team's A/B tests. An example of such a report is Second MLR Test for hewiki

Unfortunately, this tool has become profoundly outdated and no longer works. In the time since its creation we switched from storing EventLogging data in MySQL to exclusively in Hadoop/Hive, the TestSearchSatisfaction2 schema was renamed to SearchSatisfaction, and was subsequently migrated to Modern Event Platform (https://schema.wikimedia.org/repositories/secondary/jsonschema/analytics/legacy/searchsatisfaction/latest.json).

I have also continued to make updates to the underlying package wmf (now wmfdata) and factored out the code for analysis of interleaved tests to wmfastr (formerly ortiz). Needless to say, this tool is in need of a massive overhaul to work with the new data pipeline. Also it would be much easier to use and maintain as an R package, so that should also be done as part of the overhaul.

CLARIFICATION: This is to update the tool that generates a very comprehensive report and – while not a quarter's worth of work – is still a significant effort, perhaps 1-2 weeks of engineering while juggling other work. For quick & simple analysis of preference from interleaved A/B tests, see these notes on Estimating Preference For Ranking Functions With Clicks On Interleaved Search Results.

Event Timeline

mpopov renamed this task from Update Search A/B Test autoreporter to Update/repair Search A/B Test autoreporter.Aug 18 2020, 4:01 PM
mpopov created this task.
mpopov moved this task from needs triage to Tests & Analysis on the Discovery-Search board.
mpopov moved this task from Needs triage to Tracking on the Discovery-Analysis board.

The Structured Data team would love to have this to analyze the A/B tests we are doing for the new Media Search on Commons (eg T254388). We were hoping to run that test from August 24-31 2020. This will be extremely helpful in determining whether we should pursue making Media Search default on Commons and whether we should use it in Visual Editor and other file namespace searches on the Wikipedias.

We are working on improving search results on Commons by including structured data (captions, statements) etc.
It would be extremely useful (essential, actually) to be able to tell the impact of changes, to confirm that we're moving in a good direction.

Adding my support to having this tool available! As I've been working with the Structured Data team to determine what metrics we want to use to measure the impact of upcoming tests, it's become more and more clear to me that what we're doing is generally what the Discovery team were doing a couple of years ago. The hewiki report created by the tool that Mikhail linked in the descriptions contains a lot of the metrics we've been discussing with the SD team (as well as a lot of additional ones). The tool also analyzes interleaved A/B tests, something the team are planning on doing. Having all of that readily available to enable iterating on experiments with streamlined analysis would take a lot of work out of it!

Note: the "quick & simple analysis of preference from interleaved A/B tests" as described at the end of this issue's description should be good enough for the test that we'd like to run next week.
Knowing the preferential set of results is good enough to drive our immediate work.

The detailed analysis would be welcome, but is not as urgent, and I'll defer to @CBogen or @Ramsey-WMF if and when we'd need that level of detail!

I agree with Matthias - the detailed analysis would be very welcome but is not urgent. I suspect with the work on search that we are focused on, especially as we consider moving to the Wikipedias, that this will become even more welcome next quarter (in October or so).

The "quick & simple analysis of preference from interleaved A/B tests" is urgent, however. I'm hoping that @nettrom_WMF and/or @mpopov can help us run this after the A/B test ends on Aug 31.

Agreed!

I agree with Matthias - the detailed analysis would be very welcome but is not urgent. I suspect with the work on search that we are focused on, especially as we consider moving to the Wikipedias, that this will become even more welcome next quarter (in October or so).

The "quick & simple analysis of preference from interleaved A/B tests" is urgent, however. I'm hoping that @nettrom_WMF and/or @mpopov can help us run this after the A/B test ends on Aug 31.

Adding my support for this tool as well! The web team is planning to run two AB tests this quarter on planned changes to the search widget as part of the desktop improvements project. Many of the metrics provided with this tool would be extremely relevant and helpful when assessing the impact of these changes.

LGoto raised the priority of this task from Low to Medium.Sep 21 2020, 4:24 PM
LGoto moved this task from Upcoming Quarter to Current Quarter on the Product-Analytics board.

This may get pushed to next quarter because we've prioritized a new task - T266714 - ahead of this one. That other task is geared toward supporting our commitments to Accessible Content Data, and has the benefit of being timely and enabling data exploration post US-election.

Throwing this in here so it's not forgotten: https://github.com/bearloga/interleaved-python

There's even a CausalImpact-inspired report writer function:

from interleaved import Experiment

ex = Experiment(
    queries = data[data['event'] == 'click']['search_id'].to_numpy(),
    clicks = data[data['event'] == 'click']['ranking_function'].to_numpy()
)
ex.bootstrap(seed=42)

print(ex.summary(ranker_labels=['New Algorithm', 'Old Algorithm'], rescale=True))
 In this interleaved search experiment, 906 searches were used to determine whether the
results from ranker 'New Algorithm' or 'Old Algorithm' were preferred by users (based on
their clicks to the results from those rankers interleaved into a single search result
set).

 The preference statistic, as defined by Chapelle et al. (2012), was estimated to be 74.3%
with a 95% (bootstrapped) confidence interval of (70.0%, 77.9%) on [-100%, 100%] scale
with -100% indicating total preference for 'Old Algorithm', 100% indicating total
preference for 'New Algorithm', and 0% indicating complete lack of preference between the
two -- indicating that the users had preference for ranker 'New Algorithm'.