Decide how we move forward with the A/B test
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	ppelberg
	Sep 30 2019, 11:29 PM

Description

In T232237, we learned:

Why the two test buckets were unbalanced
What the implications are of this imbalance

This task is about deciding what we do next.

Do we re-run the test?
Do we re-run the test with modifications? If so, with what modifications?
Do we end the test and analyze the results in way that corrects for the test bucket unbalancing issues?

The team has been using this document to gather information and discuss this decision. It should be open for public commenting: Decision/Re-run VE default A/B test

Done

There is a clear plan that answers questions "1." "2." and "3." listed above

Related Objects
Search...

Status	Assigned	Task
Open	None	T255327 [Epic] Evaluate which editing interface should be shown by default
Open	None	T227338 Test visual editor as the default mobile editor on select wikis
Resolved	ppelberg	T234277 Decide how we move forward with the A/B test

Event Timeline

ppelberg created this task.Sep 30 2019, 11:29 PM

ppelberg updated the task description. (Show Details)Sep 30 2019, 11:56 PM

I think we should re-run the test with no modifications to the design of the test, save for any changes @MNeisler deems to be necessary as a result of answering the "Resulting questions" below.

cc @Esanders

My thinking

For this test to be worthwhile, we need to be able to:

Answer: What editing interface makes for a "better" editing experience for newer contributors?
Use the data we gather during the A/B test to do further analysis to understand and act on the factors that could be contributing to the test's outcome(s). [1]

Considering the data we've gathered in the initial test excludes all information about ~100,000+ contributors [2] or ~7% [3] of the total default-visual users, not re-running the test would lead us to rely on incomplete data for our "further analysis."

As to what – if any – other work we should do as part of re-running the test: I think we should limit this work to making sure we are tracking the data we will need to do the "further analysis" we are planning, which includes answering the questions listed here...

Questions:

How does edit completion rate vary by ____ ?
- Platform (iOS vs. Android)
- Network connection
- Editor switching
- Wiki
- Anonymous vs. registered
- Country/region
- Experience level
- Abandon timing

I think we should limit any intermedia work to the above mainly because...

The answers to the questions we've listed above, in T232175 and in this doc have more to do with our further analysis of the results and we should not conflate our drive to better understand what could be leading contributors using mobile VE to have a lower edit completion rate with our drive to make sure we have accurate data we can later use as part of this "further analysis."

Resulting questions for @MNeisler:

What – if any – modifications do you think we should make to the test to properly answer our key question: "What editing interface makes for a "better" editing experience for newer contributors?"
- e.g. Should edits that trigger abuse filters be excluded from our calculation of edit completion rate?
- e.g. Should bot edits be excluded from our calculation of edit completion rate? Maybe this is moot considering there were ~43 edits tagged bot, mobile web edits and page edits in the past 30 days on en.wiki and thus they are not likely to impact the test results.
What – if any – additional instrumentation and/or QA is needed to answer the "Questions" listed above?

"Further analysis": T232175
T229426#5468481
(1,302187-1,214,917)/1,302,187 | T229426

ppelberg edited projects, added VisualEditor (Current work); removed VisualEditor.Oct 3 2019, 3:05 PM

ppelberg moved this task from Incoming to Design and Product Analytics review on the VisualEditor (Current work) board.

Resulting questions for @MNeisler:

What – if any – modifications do you think we should make to the test to properly answer our key question: "What editing interface makes for a "better" editing experience for newer contributors?"

e.g. Should edits that trigger abuse filters be excluded from our calculation of edit completion rate?

e.g. Should bot edits be excluded from our calculation of edit completion rate? Maybe this is moot considering there were ~43 edits tagged bot, mobile web edits and page edits in the past 30 days on en.wiki and thus they are not likely to impact the test results.

I don't believe any modifications are needed to the bucketing strategy and overall test design to answer our key question.

It definitely wouldn't hurt to filter out bot edits from the calculation of the edit completion rate this time. I doubt it would have any significant effect on the test results since there are likely few tagged bot edits and they should get randomized just like any other user; however, I can easily filter them out when I rerun the analysis.

What – if any – additional instrumentation and/or QA is needed to answer the "Questions" listed above?

I can answer all of the questions listed above using existing instrumentation except for network connection; however, other available data such as geographical region, abort timing, and save failures should help provide insights into that question.

ppelberg mentioned this in T232175: Investigate VE as default A/B test findings.Oct 3 2019, 9:45 PM

Thank you, Megan. A couple quick comments in-line, below...

In T234277#5544210, @MNeisler wrote:

Resulting questions for @MNeisler:

What – if any – modifications do you think we should make to the test to properly answer our key question: "What editing interface makes for a "better" editing experience for newer contributors?"

e.g. Should edits that trigger abuse filters be excluded from our calculation of edit completion rate?

e.g. Should bot edits be excluded from our calculation of edit completion rate? Maybe this is moot considering there were ~43 edits tagged bot, mobile web edits and page edits in the past 30 days on en.wiki and thus they are not likely to impact the test results.

I don't believe any modifications are needed to the bucketing strategy and overall test design to answer our key question.

It definitely wouldn't hurt to filter out bot edits from the calculation of the edit completion rate this time. I doubt it would have any significant effect on the test results since there are likely few tagged bot edits and they should get randomized just like any other user; however, I can easily filter them out when I rerun the analysis.

Sounds great. Let's do filter them out then in the analysis. I've represented this thought in our analysis ticket: T221198

What – if any – additional instrumentation and/or QA is needed to answer the "Questions" listed above?

I can answer all of the questions listed above using existing instrumentation except for network connection; however, other available data such as geographical region, abort timing, and save failures should help provide insights into that question.

Noted. I've represented this thought and the "Questions" mentioned in T234277#5542833 in this ticket: T232175#5545364

cc @Esanders

Next steps

I think the only things we're waiting on now before the test can officially "restart" is for this patch (T232237#5535452) to be:

Merged. See: T232237#5545272
QA'd. Which – in addition to making sure contributors are being bucketed properly – should include a test of what, if any , impact it has on how long it takes an article, in read mode, to load. See: T232237#5536319

ppelberg mentioned this in T235101: Rerun mobile VE as default A/B test.Oct 9 2019, 3:16 PM

In T234277#5545434, @ppelberg wrote:

Next steps

I think the only things we're waiting on now before the test can officially "restart" is for this patch (T232237#5535452) to be:

Merged. See: T232237#5545272

QA'd. Which – in addition to making sure contributors are being bucketed properly – should include a test of what, if any , impact it has on how long it takes an article, in read mode, to load. See: T232237#5536319

The above actions are now represented in the description of this task: T235101: Rerun mobile VE as default A/B test

Decide how we move forward with the A/B testClosed, ResolvedPublicActions

Description

Done

Related ObjectsSearch...

Event Timeline

Decide how we move forward with the A/B test
Closed, ResolvedPublic
Actions

Related Objects
Search...