Page MenuHomePhabricator

VE mobile default: analyze A/B test results
Closed, ResolvedPublic

Description

This task is about re-analyzing the results of the mobile VE as default A/B test. [i]

Decision to be made

The decision this analysis is intended to help us make:
What editing interface should be shown to people who do not have a preference already set?

Decision points

These are the metrics we intended to calculate as part of this analysis:

ScenarioTitleScenarioAction
1.People who are shown the visual editor by default are more likely to continue editingdefault-visual has a higher retention rate than default-source1. Confirm revert rate meets current levels; 2. Confirm ready-rate is not significantly lower than other editing interface; 3. Create a proposal to make the VisualEditor the default mobile editing interface on all wikis; 4A. For wikis where showing newcomers and logged out editors the visual editor by default on mobile would be different from what is done on desktop (e.g. en.wiki), we would treat the A/B test results as the beginning of a conversation with that wiki. The purpose of that conversation would be to decide, together, how the mobile editing interfaces should be configured for newcomers. 4B. For wikis where showing newcomers and logged out editors the visual editor by default on mobile would be the same as what is done on desktop (e.g. ar.wiki), we would inform these wikis of the results, share our plan to make the visual editor the default on mobile for newcomers and logged out users and invite people to share evidence they think should lead us to reconsider this plan of action.
2.People who are shown the source editor by default are more likely to continue editingdefault-source has a higher retention rate than default-visualInvestigate what could have contributed to this outcome (reason: the empirical studies we've done to-date clearly show newcomers have more success and are more comfortable with the visual editor)
3.People who are shown the source editor by default and people who are shown the visual editor by default are equally likely to continue editingdefault-source has a similar retention rate than default-visualCompare interfaces across other test metrics: the total number of edits made, revert rate, edit completion rate, % of editors who make at least one successful edit

Variables that will influence decision

The editing interface that is shown by default on mobile is the interface that:

AffectMetric(s)Definition
Causes more people to continue editing Wikipediaeditor retention2nd week / 2nd month / 6 months
Causes more people to complete the edits they set out to makeedit completion rate and % of editors who make at least 1 successful editedit completion rate = the proportion of ready events that reach saveSuccess; successful edits = edits that reach SaveSuccess and are not reverted.
Does not cause more vandalismRevert rateThe percentage of completed edits reverted within 48 hours [ii]

Metrics

These are the metrics we intended to calculate as part of this analysis:

Metric(s)Definition
Editor retention2nd week / 2nd month / 6 months
Edit completion ratethe proportion of ready events that reach saveSuccess
% of editors who make at least 1 successful edit% edits that reach SaveSuccess and are not reverted w/in 48 hours [ii]
Edit quality (read: revert rate)The percentage of completed edits reverted within 48 hours [ii]
DisruptionThe percentage of edits completed in an editing interface that is not the one they were shown by default
Total number of completed editsTotal number of edits that reach SaveSuccess and are not reverted w/in 48 hours [ii]
Load timeThe amount of time that elapses between when someone presses edit (init) and the editor is ready (ready)

Open questions

  • 1. Considering the number of "exploratory" clicks, might it be better to define edit completion rate as: the proportion of firstChange events that reach saveSuccess?
  • 2. How much higher does the revert rate in either mode need to be for it to be "significant"?
  • 3. Is 90 days worth of data sufficient to confidently make this decision?

i. https://www.mediawiki.org/wiki/VisualEditor_on_mobile/VE_mobile_default
ii. While I recall "48 hours" being the standard window within which an edit is considered to have been reverted or not, I have yet to find documentation that corroborates this.


Notes

  • The AHT team will be implementing a change that could cause more anonymous editors to start editing on mobile. This change could lead to more anons starting editing sessions on mobile. We should consider this if we notice any changes in the A/B test results around anons.
    • Details about this work can be found in T189717

Related Objects

Event Timeline

nshahquinn-wmf renamed this task from VE mobile default: analyze impact to VE mobile default: analyze A/B test results.May 21 2019, 9:55 PM
nshahquinn-wmf moved this task from Triage to Epics on the Product-Analytics board.
kzimmerman triaged this task as Medium priority.Aug 20 2019, 6:43 PM

Reassigning to @MNeisler to discuss with @ppelberg as they go through priorities

Moving to blocked pending completion of AB test rerun.

kzimmerman added a subscriber: Mayakp.wiki.

Moving to backlog per prioritization discussions with @Mayakp.wiki and @ppelberg; we will not dig into this in Q3 and will revisit in Q4.

Update on parent ticket: Test is currently running
T235101#5782877

@MNeisler, below are the metrics I think are important for us to calculate as part of our analysis of the mobile VE as default A/B test.

You'll notice there are some additional metrics in the list below, that were not present in our initial analysis. [i.][ii.]

A couple resulting questions for you:

  • How much time do you estimate it will take to complete an analysis of the "Existing metrics"?
  • How much time do you estimate it will take to complete an analysis of the "Existing metrics" and "Additional metrics"?

Key metrics

Ranked by importance. Most important = 1; least important = 6.

  1. Editor retention: Are contributors in one test group more likely to come back to edit again in subsequent periods than contributors in the other test group
  2. Edit quality: Are contributors’ edits in one test group more likely to be reverted than contributors’ edits in another test group?
  3. Disruption: Are contributors in one test group switching to and completing edits in an editing interface that was not the default?
  4. Edit completion rate: of the contributors who start an edit, what percentage successfully save/publish their edits, and how do those rates compare across the two test groups and experience levels?
  5. Total number of completed edits: Do contributors in one test group complete more edits than contributors in the other test group?
  6. Time the editing interface takes to load
  7. Time to save an edit: Do contributors in one test group complete their edits more quickly than contributors in the other test group? This is a metric we would need to look at alongside other measures, like the size of the edits being made.

Additional metrics

  • % of editors who make at least 1 successful edit in the test and control groups
  • Average bytes added by editors in the test and control groups

i. T221195
ii. T229426

@MNeisler, I made some changes to T221198#6183436. Those changes are also below so they're [ideally] easier for you to detect:

  • ADDED to Key metrics: "Time the editing interface takes to load"
    • Rationale: The first iteration of this test showed load times varied between editing interfaces: .
  • REMOVED from Key Metrics: "Time to save an edit: Do contributors in one test group complete their edits more quickly than contributors in the other test group?"
    • Rationale: this seems like a noisy metric that does not offer much insight into how a particular editing interface affects the likelihood someone will start and continue editing over a longer period of time.
  • CHANGED "Key metrics" order

@ppelberg

Thanks for clarifying the scope. See answers to your open questions as well as some follow-up questions below:

How much time do you estimate it will take to complete an analysis of the "Existing metrics"?

There are queries or similar queries that can be modified to calculate a number of these existing metrics. I would estimate 1 1/2 to 2 weeks to complete the analysis accounting for time needed to update/modify the queries, run the code, and QA results.

How much time do you estimate it will take to complete an analysis of the "Existing metrics" and "Additional metrics"?

The additional metrics do not add too much complexity to the analysis. I would estimate another 3 days so together I think it would take 2 to 2 1/2 weeks to calculate all the listed metrics.

Follow-up Questions

  • What are the decision points for this analysis? (i.e What results will determine what actions the team will take?) I don't remember if this has already been decided or documented somewhere but I recommend we confirm prior to completing the analysis. Happy to discuss with you and/or the team further to get consensus.
  • Edit Completion Rate: Are we defining as the proportion of ready events that are saved? This was the definition used for the edit card analysis but just want to confirm.
  • % of editors who make at least 1 successful edit - To confirm are we defining Success as edits that reach 'SaveSuccess' or/and that the edit is not reverted?

1.5-2 weeks for "Key metrics" and 2-2.5 weeks if we include "Additional metrics"...understood. Thank you, @MNeisler.

Answers to the follow up questions you posed below...


Follow-up Questions

  • What are the decision points for this analysis? (i.e What results will determine what actions the team will take?) I don't remember if this has already been decided or documented somewhere but I recommend we confirm prior to completing the analysis. Happy to discuss with you and/or the team further to get consensus.

Good call. Originally, we'd defined the decision points through the frame of edit completion rate [i][ii]. Although, I no longer think this is the correct key metric to use.

I will draft a new scenario plan for us (you, the team and I) to discuss.

  • Edit Completion Rate: Are we defining as the proportion of ready events that are saved? This was the definition used for the edit card analysis but just want to confirm.

Originally, we'd defined edit completion rate as the percent of init events that reach saveSuccess [iii]. I think this definition should be changed to what you described: the proportion of ready events that reach saveSuccess.

This is not to say we will not consider load times when determining which editing interface is "better," but rather that we will calculate load times independent of edit completion rates.

  • % of editors who make at least 1 successful edit - To confirm are we defining Success as edits that reach 'SaveSuccess' or/and that the edit is not reverted?

Let's define them as edits that reach SaveSuccess and are not reverted.


Next steps

  • @ppelberg to draft scenario plan for potential test result outcomes.

i. https://www.mediawiki.org/wiki/VisualEditor_on_mobile/VE_mobile_default#Test_scenarios
ii. T226687
iii. https://nbviewer.jupyter.org/github/wikimedia-research/2019-07-mobile-interfaces-experiment/blob/master/analysis.ipynb

Moving this task to Upcoming Quarter on the Product Analytics board as confirmed in weekly 1:1 with Peter and Megan.

ppelberg added a subscriber: Esanders.

Non sequitur

  • Adding a "Notes" section to the task description to document the potential for an increase in anon editing sessions as a result of AHT starting work on T189717. Thank you for raising this, @Esanders
ppelberg updated the task description. (Show Details)

Next steps

  • @ppelberg to draft scenario plan for potential test result outcomes.

The task description has been updated with a provisional decision tree.

Next steps

Next steps

Below are the decisions we came to in these conversations:

  • DECIDED: we will expand the "Decision points" so they take into account the editing defaults wikis have in place on desktop.
    • Reason: the "Decision points" and resulting courses of action should be as specific as possible.
    • Resulting action: the "Action" column of the "Decision points" table in the task description has been updated to reflect this.
  • DECIDED: the behavior of people who have edited on desktop before, but not on mobile should NOT influence the decision about what editing interface is shown on mobile by default to people who have not edited Wikipedia before.
    • Reason: This test about is about determining what editing interface is more likely to cause new contributors to continue editing Wikipedia. As such, the behavior of people who have edited Wikipedia, albeit on a different platform, should not influence this decision.
    • Resulting action: The segment of users who have edited on desktop before should not impact the decision about what mobile editing interface is shown to people who have not edited Wikipedia before.

Below are the questions that surfaced in these conversations:

  • How much higher does the revert rate in either mode need to be for it to be significant?
    • The above has been added to the "Open questions" section of the task description.

Below are the questions that surfaced in these conversations:

  • How much higher does the revert rate in either mode need to be for it to be significant?

@MNeisler and I talked about the revert rate some more today. While the question above remains open, we did advance the conversation about the A/B test. The ways in which we "advanced" the conversation are noted below.

Notes

  • DECIDED: in the context of "Scenario 3," we are considering revert rate to be a guardrail metric.
    • Where "guardrail metric" means we do not want to see the revert rate go up at the expense of optimizing other goal metrics which, in this scenario, are the following:
      • % of editors who make at least one successful edit
      • Edit completion rate
      • The total number of edits made
    • Rationale: the goal of this experiment [and the potential changes to editing interface defaults it precipitates] is to grow the number of people who continue making productive edits [i] to Wikipedia. The goal of this experiment is not to optimize the revert rate rather the intent is to ensure that any change in editing interface does not negatively affect it to a significant extent.
  • DECIDED: in order to determine whether the revert rates in either test bucket are notable, we need to a baseline revert rate to compare them to. Today, we came to define the "baseline revert rate" as edits that meet the conditions listed below. The work to calculate this revert rate will happen in this task: T259196
    • Edits made on desktop web
    • Edits made using the visual editor or the wikitext editor
    • Edits made by people
      • Edits made by people who are logged out or who have made 0 cumulative edits
    • Edits made to Wikipedia
      • Edits made to a content namespace

i. Where "productive edits" mean edits that are not reverted within 48 hours of being made.

18-Sep update
Megan and I are finalizing the draft update that we will post to the VisualEditor on mobile/VE mobile default page.

18-Sep update
Megan and I are finalizing the draft update that we will post to the VisualEditor on mobile/VE mobile default page.

✅ Done: https://w.wiki/er7

LGoto lowered the priority of this task from Medium to Low.Mar 15 2021, 4:30 PM
LGoto moved this task from Upcoming Quarter to Backlog on the Product-Analytics board.

Removing task assignee due to inactivity, as this open task has been assigned for more than two years. See the email sent to the task assignee on February 06th 2022 (and T295729).

Please assign this task to yourself again if you still realistically [plan to] work on this task - it would be welcome.

If this task has been resolved in the meantime, or should not be worked on ("declined"), please update its task status via "Add Action… 🡒 Change Status".

Also see https://www.mediawiki.org/wiki/Bug_management/Assignee_cleanup for tips how to best manage your individual work in Phabricator.

ppelberg moved this task from FY 18-19 Q3/Q4 to Triaged on the VisualEditor board.

UPDATE
We've started analyzing the data from this A/B test and will update this ticket with more details about what we're finding in the coming weeks.

MNeisler edited projects, added Product-Analytics (Kanban); removed Product-Analytics.

We have completed the analysis of the A/B test and the results have been shared on the project page.

Below is a summary of the key results:

The test ran at 20 Wikipedias from 1 November 2019 through 26 September 2022 and included new contributors who have not edited on any platform before and who were both logged-in and logged-out.

Results

  • The editing interface shown as default on mobile did not significantly increase or decrease the likelihood a person would return to publish an edit on mobile after their first mobile edit. None of these differences were determined to be statistically significant.
    • 1.7% of people who were shown visual editor as the default editor returned to make one or more additional mobile edits 2 weeks after their first mobile edit compared to 2.0% of people shown wikitext as the default editor.
    • 2.3% of people who were shown visual editor as the default editor returned to make one or more additional mobile edits 2 months after their first mobile edit compared to 2.5% of people shown wikitext as the default editor.
    • 0.9% of registered people who were shown visual editor as the default editor returned to make one or more additional mobile edits 6 months after their first mobile edit compared to 1.0% of registered people shown wikitext as the default editor.
  • People shown visual editor as the default editor were slightly more successful publishing the edits they started and slightly less successful publishing non-reverted edits. These differences are statistically significant, and the absolute difference is small (under 1 percent absolute difference).
    • People who were shown the visual editor as the default editor published the edits they started at a rate (3.1%) that was 11.6% higher than the rate (2.8%) at which the people who were shown the wikitext as the default editor.
    • Excluding all reverted edits, people who were shown the visual editor as the default editor published the edits they started at a rate (1.85%) that was 2.0% lower than the rate (1.88%) at which the people who were shown the wikitext as the default editor.
  • People shown visual editor as the default editing interface were slightly more likely to successfully publish at least 1 edit. This difference is statistically significant and the absolute difference is medium.
    • 1.3% of people that were shown visual editor as the default editor were able to successfully publish at least one mobile edit during the A/B test compared to 0.9% of people shown wikitext as the default editor. This represents a 44% increase.
  • People shown the visual editor as the default editing interface were more likely to be reverted. The difference is statistically significant and the absolute difference is medium.
    • Edits by people shown the visual editor as the default editing interface were reverted at a rate (40.1%) 26% higher than the rate at which people were shown wikitext (31.9%).

The full A/B test report can be found here: Mobile VE as Default AB Test Analysis Report