Page MenuHomePhabricator

Update/repair Search A/B Test autoreporter
Open, MediumPublic

Description

autoreporter was a tool that @chelsyx (no longer at the Foundation) wrote – based on our manually-created reports (e.g. First assessment of learning-to-rank) – to generate reports of Search Platform team's A/B tests. An example of such a report is Second MLR Test for hewiki

Unfortunately, this tool has become profoundly outdated and no longer works. In the time since its creation we switched from storing EventLogging data in MySQL to exclusively in Hadoop/Hive, the TestSearchSatisfaction2 schema was renamed to SearchSatisfaction, and was subsequently migrated to Modern Event Platform (https://schema.wikimedia.org/repositories/secondary/jsonschema/analytics/legacy/searchsatisfaction/latest.json).

I have also continued to make updates to the underlying package wmf (now wmfdata) and factored out the code for analysis of interleaved tests to wmfastr (formerly ortiz). Needless to say, this tool is in need of a massive overhaul to work with the new data pipeline. Also it would be much easier to use and maintain as an R package, so that should also be done as part of the overhaul.

CLARIFICATION: This is to update the tool that generates a very comprehensive report and – while not a quarter's worth of work – is still a significant effort, perhaps 1-2 weeks of engineering while juggling other work. For quick & simple analysis of preference from interleaved A/B tests, see these notes on Estimating Preference For Ranking Functions With Clicks On Interleaved Search Results.

Event Timeline

mpopov renamed this task from Update Search A/B Test autoreporter to Update/repair Search A/B Test autoreporter.Aug 18 2020, 4:01 PM
mpopov created this task.
mpopov moved this task from needs triage to Tests & Analysis on the Discovery-Search board.
mpopov moved this task from Needs triage to Tracking on the Discovery-Analysis board.
mpopov triaged this task as Low priority.Aug 18 2020, 4:10 PM
CBogen added a subscriber: CBogen.Aug 18 2020, 4:12 PM

The Structured Data team would love to have this to analyze the A/B tests we are doing for the new Media Search on Commons (eg T254388). We were hoping to run that test from August 24-31 2020. This will be extremely helpful in determining whether we should pursue making Media Search default on Commons and whether we should use it in Visual Editor and other file namespace searches on the Wikipedias.

We are working on improving search results on Commons by including structured data (captions, statements) etc.
It would be extremely useful (essential, actually) to be able to tell the impact of changes, to confirm that we're moving in a good direction.

mpopov updated the task description. (Show Details)Aug 18 2020, 4:20 PM

Adding my support to having this tool available! As I've been working with the Structured Data team to determine what metrics we want to use to measure the impact of upcoming tests, it's become more and more clear to me that what we're doing is generally what the Discovery team were doing a couple of years ago. The hewiki report created by the tool that Mikhail linked in the descriptions contains a lot of the metrics we've been discussing with the SD team (as well as a lot of additional ones). The tool also analyzes interleaved A/B tests, something the team are planning on doing. Having all of that readily available to enable iterating on experiments with streamlined analysis would take a lot of work out of it!

Note: the "quick & simple analysis of preference from interleaved A/B tests" as described at the end of this issue's description should be good enough for the test that we'd like to run next week.
Knowing the preferential set of results is good enough to drive our immediate work.

The detailed analysis would be welcome, but is not as urgent, and I'll defer to @CBogen or @Ramsey-WMF if and when we'd need that level of detail!

I agree with Matthias - the detailed analysis would be very welcome but is not urgent. I suspect with the work on search that we are focused on, especially as we consider moving to the Wikipedias, that this will become even more welcome next quarter (in October or so).

The "quick & simple analysis of preference from interleaved A/B tests" is urgent, however. I'm hoping that @nettrom_WMF and/or @mpopov can help us run this after the A/B test ends on Aug 31.

Agreed!

I agree with Matthias - the detailed analysis would be very welcome but is not urgent. I suspect with the work on search that we are focused on, especially as we consider moving to the Wikipedias, that this will become even more welcome next quarter (in October or so).

The "quick & simple analysis of preference from interleaved A/B tests" is urgent, however. I'm hoping that @nettrom_WMF and/or @mpopov can help us run this after the A/B test ends on Aug 31.

Adding my support for this tool as well! The web team is planning to run two AB tests this quarter on planned changes to the search widget as part of the desktop improvements project. Many of the metrics provided with this tool would be extremely relevant and helpful when assessing the impact of these changes.

LGoto raised the priority of this task from Low to Medium.Sep 21 2020, 4:24 PM
LGoto moved this task from Upcoming Quarter to Current Quarter on the Product-Analytics board.

This may get pushed to next quarter because we've prioritized a new task - T266714 - ahead of this one. That other task is geared toward supporting our commitments to Accessible Content Data, and has the benefit of being timely and enabling data exploration post US-election.