Page MenuHomePhabricator

Create a parameterized report template for search team's A/B tests
Closed, ResolvedPublic10 Estimated Story Points

Description

The idea is to create an A/B test analysis/report template (see http://rmarkdown.rstudio.com/developer_parameterized_reports.html) that allows anyone (mostly Chelsy and Mikhail but also Erik, for example) to specify a few parameters in a yaml file– e.g. SQL query, metrics, aggregation level – and then generate a report that fetches the appropriate data, cleans it up, and runs through the specified metrics. This will not replace the thorough analysis and report with interpretation provided by analysts. It will only serve as an initial peek for the team.

According to Mikhail and Chelsy's discussion (T131795#3375527), the first prototype will include:

  • Configuration reader
  • Data fetcher + processor
  • Summary stats in a table:
    • CTR
    • ZRR
    • First clicked
  • Generate Rmd/markdown of report

Event Timeline

debt triaged this task as Lowest priority.Jun 7 2016, 8:22 PM
debt moved this task from Needs triage to Later on the Discovery-Analysis board.
debt added a subscriber: debt.

this is a nice to have item but not something we need to do right away.

We can take a look at this when we start (if we start) doing many more tests.

Hi @mpopov and @chelsyx - is this something that needs to be done before we start work on T143589 and T143587

Every A/B test we've done so far has nuances that require a unique approach that cannot be automated. I no longer think this would be useful for us to do. Maybe we can come back to this question once we're deploying a test every week :)

Declining this ticket - I'm not sure we'll ever have tests that are exactly the same and not using a unique approach to it.

mpopov reopened this task as Open.EditedJun 23 2017, 7:00 PM
mpopov reassigned this task from mpopov to chelsyx.

Chelsy and I talked about this and seeing as there is a need for faster A/B test report turnaround, she'd like to take lead on this project. Notes from our discussion:

Premise:

  1. Input parameters into an YAML config file (e.g. TSS2 revision number, names of test groups as they appear in log DB, SQL query, metrics, aggregation level -- e.g. overall vs. by wiki)
  2. Run script/pipeline (fetches data, cleans it up, calculates summary stats, fits models and outputs estimates + intervals, maybe some built-in interpretations)
  3. A PDF/HTML report is generated (depends on T168683)

Metrics:

  • Clickthrough
  • Zero result rate + credible intervals
  • PaulScore + bootstrapped confidence interval
  • Dwell time / survival (LD50 + conf int)
  • Scroll on visited page
  • First clicked result position
  • Maximum clicked result position
  • Did the user look at other pages of the search results

MVP

  • Configuration reader
  • Data fetcher + processor
  • Summary stats in a table:
    • CTR
    • ZRR
    • First clicked
  • Generate Rmd/markdown of report
debt raised the priority of this task from Lowest to Medium.Jun 23 2017, 8:49 PM

Yay, thanks for taking the initiative on this!

chelsyx set the point value for this task to 10.

First prototype: https://github.com/chelsyx/auto_test_analysis
@mpopov Since the pandoc hasn't been upgraded on stat1002, we can't test it on analysis cluster. But feel free to test it locally: in bash, run Rscript run.R

Example report: https://people.wikimedia.org/~chelsyx/reports/Example1.html
The yaml file with the parameters for this example report is: https://people.wikimedia.org/~chelsyx/reports/Example1.yaml

Next, I will 1) run the report with dataset of previous A/B tests to see if there is any problems; 2) work on built-in interpretations.

Can some definitions be added to the document—ie: what does 'checkin' mean vs 'click' vs 'esclick' vs 'iwclick' vs 'ssclick'.

Also, I'm not sure how the sister project snippets are part of the explore similar testing? Were we still collecting that data and wanted to compare the usage of the sister project clickthroughs vs the explore similar links? If so, that needs more clarification in the document.

The overall document should have links to the tickets in Phabricator and a short summary of what the test was about. Using terms like 'similar' and 'languages' or 'categories' don't really mean much unless you have an idea of what the test was about.

Let me know if the above isn't clear. :)

Hi @debt ! Thank you for reviewing!

Can some definitions be added to the document—ie: what does 'checkin' mean vs 'click' vs 'esclick' vs 'iwclick' vs 'ssclick'.

Sure! Will do.

Also, I'm not sure how the sister project snippets are part of the explore similar testing? Were we still collecting that data and wanted to compare the usage of the sister project clickthroughs vs the explore similar links? If so, that needs more clarification in the document.

Sister project clickthroughs is unrelated to the explore similar test. But this example report is not a customized report for explore similar test. The purpose of this tool is to compute as many tables/metrics as possible for all future A/B tests, so some metrics might be useful to a particular test, some might not. This example report is showing all it can provide now. If there are some metrics we don't want to see for a particular test, we can specify the related event actions in the yaml file. For example, for the explore similar test, instead of including all event actions, in the yaml file we can say: event_action: [searchResultPage, click, esclick, hover-on, hover-off].

The overall document should have links to the tickets in Phabricator and a short summary of what the test was about. Using terms like 'similar' and 'languages' or 'categories' don't really mean much unless you have an idea of what the test was about.

I will add the fields to the yaml file so that we can add ticket link and test description to the report. However, this auto-report will not replace the customized report written by @mpopov or me. It's just a quick, incomplete and possibly erroneous initial peek to let us -- who know what the test was about -- know what was going on.

Did I answer your questions? ;)

Great job with this, @chelsyx!!! This is going to be such a useful tool when it's done! (Which it almost is! :P)

Status update: All Chelsy needs to is review my pull request on GH & resolve a few issues and it'll be ready for general beta-testing :D

Thank you very much @mpopov ! I'm reviewing now.

This comment was removed by greg.
chelsyx added subscribers: TJones, EBernhardson.

Hello @EBernhardson and @TJones , could you please help me review these examples generated by this report generating tool?
Example 1 is the auto-generated report for the explore similar test: https://analytics.wikimedia.org/datasets/discovery/reports/Example1.html
Example 2 uses the data of second bm25 test, which shows more by-wiki breakdown than example 1: https://analytics.wikimedia.org/datasets/discovery/reports/Example2.html

The code base is here: https://github.com/chelsyx/auto_test_analysis. You can test it on stat1005. The yaml files used to generate the above examples are https://people.wikimedia.org/~chelsyx/example1.yaml and https://people.wikimedia.org/~chelsyx/example2.yaml

The auto-generated report format is an easy way to get quick feedback on a test that was done—I think it looks good and we should continue to improve on it. The end result of this ticket was to generate something quickly and then follow up with a full fledged report.

@chelsyx—So, I'm late to the party, but I wanted to say that this looks great! Having this much analysis done automatically is going to be very helpful.

A few minor comments:

  • Is it possible to automatically highlight possible discrepancies in Test Summary/Browser & OS? (Assuming that's a regular feature of the automated analysis.) In the explore similar example, the numbers are small, so I'm not sure if 4.2% vs 2.5% (for Mac OS X 10.11) is "interesting" or just a small sample size.
  • In Example2, are the "Sister Search" and "Explore Similar" tabs supposed to be there? I'm guessing it was a bit of config copied over from the first example.
  • Super minor UI nit-pick: when the page loads, or a tab under Data Summary is selected, the default sub-tab has red text, not black text, like it does if you actually click on it.

Looks great!

Thank you very much @TJones !

For your first comment, I add Bayes factor to the table (see the example below). When Bayes factor >= 2, the discrepancies are substantial and the Bayes factor will be highlighted.

CirrusSearch MLR AB test.png (754×1 px, 177 KB)

The second and third comments are fixed too.

@chelsyx - you should do a blog post on this! I think it's a great opportunity to provide an overview of why (and how) we do testing along with examples of how our testing can show a marked improvement and/or pointed to potential problems that we didn't know we had (in regards to search). A portion of the blog could, of course, be all detailed and techie on the how side of things. :)

I think the final step is to transfer the repo from @chelsyx's personal GitHub account over to Gerrit (see Gerrit/New repositories for instructions) and add licensing info.

For the repo name… "wikimedia/discovery/autoreporter"?

Moving this back to waiting, @chelsyx ...to be sure we don't forget to do this step: T131795#3622998