Page MenuHomePhabricator

Check on VE as default A/B test results
Closed, ResolvedPublic

Description

Overview

We should now have enough data to do an initial check on the A/B test results, considering we started gathering "good data" on 14 July.

Done

Generate graphs showing how the following metrics compare between the treatment and control test groups:

  • Edit completion rate
  • Total number of completed edits
  • Time to save an edit
  • Edit size
  • Edit quality
  • Editing interface switching

Event Timeline

ppelberg created this task.Jul 31 2019, 2:27 PM
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptJul 31 2019, 2:27 PM
ppelberg updated the task description. (Show Details)

Updating the task description to include the metrics we would like to check on. Those metrics are reflected in the task description's "Done" section.

Neil_P._Quinn_WMF triaged this task as Medium priority.Aug 26 2019, 4:17 PM
Neil_P._Quinn_WMF moved this task from Triage to Doing on the Product-Analytics board.

During our 28 August team planning meeting, we discussed the initial results from the A/B test. Below is an unranked list of the questions that surfaced in that conversation.

For now, consider these questions as notes.

Open questions

  • 1) Why are number of contributors in each test bucket not more balanced? Potential explanation: When VE fails to load, we attempt loading source editor instead, this might partially explain the increased source editor load attempts.
  • 2) What event should mark the "start" of an edit? Asked in the context of calculating edit completion rate.
  • 3) Are the differences in edit completion rate between defualt-source and default-visual statistically significant?
  • 4) How do the results vary across wikis? Some of the wikis we included in the test are significantly larger than others...perhaps they had an outsized impact on the outcome.
  • 5) Is there a qualitative difference in the kinds of edit being attempted in either editing interface?
  • 6) How does the quality of edits vary between the two test groups?
  • 7) Does source produce more quick-followup edits to fix issues from syntax mistakes?
  • 8) How many contributors switched away from their assigned one?
  • 9) 99+% of edit sessions don't result in a save...what is happening in those sessions?
  • 10) What did we record if a user immediately switched to wikitext and completed their edit there? This could happen quite a lot if an existing user edited on their phone without bothering to sign in.
  • 11) Are we doing any analysis on the effects of mobile ve on knowledge equity?
  • 12) Can we group the results based on the user’s network connection quality, or if that’s unavailable, by geographical region or abort timing or something? Thinking: it looks like VE users are more successful if the editor loads correctly, but less overall. This might mean that the editor fails to load often (which would not be unexpected because it is larger and takes longer to load, and the user might get impatient and cancel).

@Neil_P._Quinn_WMF @ppelberg is the check-in component done? Are the notes from discussion potential future tasks?

Neil_P._Quinn_WMF closed this task as Resolved.Sep 5 2019, 3:55 PM

Megan and I have done the initial checks and found the following.

53.4% of the users (both registered and anonymous) who were bucketed ended up in wikitext default bucket. It turns out that it would be incredibly unlikely (p << 10^-15) to get an imbalance this big if our random assignment was actually 50%–50%. So there's clearly a serious issue somewhere that we need to understand.

bucketusers
source default1,302,187
visual default1,214,917

With that said, this is a preliminary look at our key metrics. The table shows the value of each metric for the average user in the bucket (because each user is independent of each other, but different attempts by the same user are not independent).

bucketattemptscompleted editsedit completion rate
source default1.3990.0400.87%
visual default1.4030.0370.79%

In addition, among completed edits in each bucket, the average editing time was:

bucketediting time
source default2 min, 12 s
visual default2 min, 46 s

@Neil_P._Quinn_WMF @ppelberg is the check-in component done? Are the notes from discussion potential future tasks?

@ppelberg, @MNeisler, and I discussed this yesterday and, yes, at this point we're calling the preliminary check done. Megan will dig into some of the issues above as part of T221198.

...Are the notes from discussion potential future tasks?

@kzimmerman, yep. Although, I've created a task that [hopefully] makes this more explicit: T232175.

Side note: I wonder whether using Phabricator as a drafting space (as I'm doing in T232175) is appropriate. So please, if you see a better way, I'm all ears (or I guess eyes in this context).


@ppelberg, @MNeisler, and I discussed this yesterday and, yes, at this point we're calling the preliminary check done. Megan will dig into some of the issues above as part of T221198.

@MNeisler + @Neil_P._Quinn_WMF, thanks for pulling these results together. A question [1] related to "Editing interface switching"...

In the following scenario, how would the test make sense of this contributor's editing session? Would the test consider this contributor as having abandoned their edit in default-visual?

Scenario
i. Contributor taps edit
ii. Contributor is bucketed into default-visual
Visual editor loads (read: reaches ready)
iii. Contributor switches to wikitext
iv. Contributor makes some changes
v. Contributor publishes their changes


  1. Neil, we may have discussed this before, but a quick search of phab and our shared doc didn't surface anything...
Neil_P._Quinn_WMF added a comment.EditedSep 6 2019, 12:25 PM

@MNeisler + @Neil_P._Quinn_WMF, thanks for pulling these results together. A question [1] related to "Editing interface switching"...
In the following scenario, how would the test make sense of this contributor's editing session? Would the test consider this contributor as having abandoned their edit in default-visual?
Scenario
i. Contributor taps edit
ii. Contributor is bucketed into default-visual
Visual editor loads (read: reaches ready)
iii. Contributor switches to wikitext
iv. Contributor makes some changes
v. Contributor publishes their changes
Neil, we may have discussed this before, but a quick search of phab and our shared doc didn't surface anything...

You're right, I don't think we've discussed this before. Since switching interface on mobile doesn't result in a page reload, it won't cause the creation of a new editing session ID. That means the scenario you described would be treated as a single completed attempt. Deciding which interface to credit for the completion is complex, but deciding which bucket is simple since switching interfaces shouldn't affect the bucket.

...Since switching interface on mobile doesn't result in a page reload, it won't cause the creation of a new editing session ID. That means the scenario you described would be treated as a single completed attempt. Deciding which interface to credit for the completion is complex, but deciding which bucket is simple since switching interfaces shouldn't affect the bucket.

Understood, ok. Thanks, Neil.

As a mental note: in the instance described in T229426#5470251, we would count this contributor as having completed their edit in the test bucket they were initially assigned.