Page MenuHomePhabricator

Decide how we move forward with the A/B test
Closed, ResolvedPublic

Description

In T232237, we learned:

  • Why the two test buckets were unbalanced
  • What the implications are of this imbalance

This task is about deciding what we do next.

  1. Do we re-run the test?
  2. Do we re-run the test with modifications? If so, with what modifications?
  3. Do we end the test and analyze the results in way that corrects for the test bucket unbalancing issues?

The team has been using this document to gather information and discuss this decision. It should be open for public commenting: Decision/Re-run VE default A/B test

Done

  • There is a clear plan that answers questions "1." "2." and "3." listed above

Event Timeline

I think we should re-run the test with no modifications to the design of the test, save for any changes @MNeisler deems to be necessary as a result of answering the "Resulting questions" below.

cc @Esanders


My thinking

For this test to be worthwhile, we need to be able to:

  1. Answer: What editing interface makes for a "better" editing experience for newer contributors?
  2. Use the data we gather during the A/B test to do further analysis to understand and act on the factors that could be contributing to the test's outcome(s). [1]

Considering the data we've gathered in the initial test excludes all information about ~100,000+ contributors [2] or ~7% [3] of the total default-visual users, not re-running the test would lead us to rely on incomplete data for our "further analysis."

As to what – if any – other work we should do as part of re-running the test: I think we should limit this work to making sure we are tracking the data we will need to do the "further analysis" we are planning, which includes answering the questions listed here...

Questions:

  • How does edit completion rate vary by ____ ?
    • Platform (iOS vs. Android)
    • Network connection
    • Editor switching
    • Wiki
    • Anonymous vs. registered
    • Country/region
    • Experience level
    • Abandon timing

I think we should limit any intermedia work to the above mainly because...

The answers to the questions we've listed above, in T232175 and in this doc have more to do with our further analysis of the results and we should not conflate our drive to better understand what could be leading contributors using mobile VE to have a lower edit completion rate with our drive to make sure we have accurate data we can later use as part of this "further analysis."


Resulting questions for @MNeisler:

  • What – if any – modifications do you think we should make to the test to properly answer our key question: "What editing interface makes for a "better" editing experience for newer contributors?"
    • e.g. Should edits that trigger abuse filters be excluded from our calculation of edit completion rate?
    • e.g. Should bot edits be excluded from our calculation of edit completion rate? Maybe this is moot considering there were ~43 edits tagged bot, mobile web edits and page edits in the past 30 days on en.wiki and thus they are not likely to impact the test results.
  • What – if any – additional instrumentation and/or QA is needed to answer the "Questions" listed above?

  1. "Further analysis": T232175
  2. T229426#5468481
  3. (1,302187-1,214,917)/1,302,187 | T229426

Resulting questions for @MNeisler:

  • What – if any – modifications do you think we should make to the test to properly answer our key question: "What editing interface makes for a "better" editing experience for newer contributors?"
    • e.g. Should edits that trigger abuse filters be excluded from our calculation of edit completion rate?
    • e.g. Should bot edits be excluded from our calculation of edit completion rate? Maybe this is moot considering there were ~43 edits tagged bot, mobile web edits and page edits in the past 30 days on en.wiki and thus they are not likely to impact the test results.

I don't believe any modifications are needed to the bucketing strategy and overall test design to answer our key question.

It definitely wouldn't hurt to filter out bot edits from the calculation of the edit completion rate this time. I doubt it would have any significant effect on the test results since there are likely few tagged bot edits and they should get randomized just like any other user; however, I can easily filter them out when I rerun the analysis.

  • What – if any – additional instrumentation and/or QA is needed to answer the "Questions" listed above?

I can answer all of the questions listed above using existing instrumentation except for network connection; however, other available data such as geographical region, abort timing, and save failures should help provide insights into that question.

Thank you, Megan. A couple quick comments in-line, below...

Resulting questions for @MNeisler:

  • What – if any – modifications do you think we should make to the test to properly answer our key question: "What editing interface makes for a "better" editing experience for newer contributors?"
    • e.g. Should edits that trigger abuse filters be excluded from our calculation of edit completion rate?
    • e.g. Should bot edits be excluded from our calculation of edit completion rate? Maybe this is moot considering there were ~43 edits tagged bot, mobile web edits and page edits in the past 30 days on en.wiki and thus they are not likely to impact the test results.

I don't believe any modifications are needed to the bucketing strategy and overall test design to answer our key question.

It definitely wouldn't hurt to filter out bot edits from the calculation of the edit completion rate this time. I doubt it would have any significant effect on the test results since there are likely few tagged bot edits and they should get randomized just like any other user; however, I can easily filter them out when I rerun the analysis.

Sounds great. Let's do filter them out then in the analysis. I've represented this thought in our analysis ticket: T221198

  • What – if any – additional instrumentation and/or QA is needed to answer the "Questions" listed above?

I can answer all of the questions listed above using existing instrumentation except for network connection; however, other available data such as geographical region, abort timing, and save failures should help provide insights into that question.

Noted. I've represented this thought and the "Questions" mentioned in T234277#5542833 in this ticket: T232175#5545364

cc @Esanders

Next steps

I think the only things we're waiting on now before the test can officially "restart" is for this patch (T232237#5535452) to be:

  1. Merged. See: T232237#5545272
  2. QA'd. Which – in addition to making sure contributors are being bucketed properly – should include a test of what, if any , impact it has on how long it takes an article, in read mode, to load. See: T232237#5536319

Next steps

I think the only things we're waiting on now before the test can officially "restart" is for this patch (T232237#5535452) to be:

  1. Merged. See: T232237#5545272
  2. QA'd. Which – in addition to making sure contributors are being bucketed properly – should include a test of what, if any , impact it has on how long it takes an article, in read mode, to load. See: T232237#5536319

The above actions are now represented in the description of this task: T235101: Rerun mobile VE as default A/B test