Page MenuHomePhabricator

Re-run metrics from VE on mobile report
Closed, DeclinedPublic

Description

Background

We'd like to update some of the metrics from the VisualEditor on mobile report, some of which are now a year old.

1. Desktop edit completion rate
We'd like to know the edit completion rate on desktop, broken out by editing interface, user experience and wikis where VE is available by default in the editing interface for non-logged in users. (see "Done" for more details).

2. Mobile edit completion rate
One of the questions we are trying to answer through the VE as default A/B test is: "Does making VE the default mobile editing interface increases the overall mobile edit completion rate?"

To answer this question, we'll need a baseline measure of the current overall edit completion rate, a number we do not currently have (this metric was last updated in May 2018).

3. Mobile edit sessions
We'd like to know the experience level of the contributors attempting to make edits on mobile. We think this measure will be helpful in figuring out where best to intervene. Are most edit sessions from newer contributors? If so, maybe we should focus on onboarding... Are most edit sessions from more experienced contributors? If so, maybe we should focus on optimizing "deeper" parts of the edit funnel...

"Done"

  • 1. A table that contains desktop edit completion rate by interface, user experience. Wikis where VE is not available by default on desktop (i.e. en.wiki, es.wiki, nl.wiki) should be broken out and evaluated separately from wikis where VE is available by default on desktop. [1]
  • 2. A table that contains mobile edit completion rate by interface [2] and user experience [3] (see: T202134)
  • 3. A table that contains mobile edit sessions by interface and user experience

---.

  1. /VisualEditor/Rollouts
  2. Interface: wikitext / VE
  3. User experience: number of edits

Event Timeline

ppelberg renamed this task from Re-run edit completion rate for mobile VE to Re-run metrics from VE on mobile report .May 20 2019, 10:10 PM
ppelberg updated the task description. (Show Details)
ppelberg updated the task description. (Show Details)
ppelberg added a subscriber: DannyH.
kzimmerman subscribed.

Adjusting priority to High.

As I understand it, the most urgent need is for Desktop edit completion rate segmented by whether VE is default or not, in addition to editing interface and user experience.

As I understand it, the most urgent need is for Desktop edit completion rate segmented by whether VE is default or not, in addition to editing interface and user experience.

That's correct. Thank you for clarifying, @kzimmerman.

Thanks to @Neil_P._Quinn_WMF, we have a draft analysis of the following metrics:
A. Desktop edit completion rate
B. Mobile edit completion rate
C. Mobile edit sessions

A couple questions:

  • 1. For both "Completion rate" metrics – Out[88] and Out[92] – are we able to see the overall completion rate for each interface and platform (i.e. overall edit completion rate for desktop VE, overall edit completion rate for desktop wikitext, overall edit completion rate for mobile VE, overall edit completion rate for mobile wikitext)?
  • 2. For Out[92], "Desktop completion rate by interface, experience, and VE default status," are we able to see how wikitext and VE edit completion rates for wikis where VE is and is not the default?
  • 3. Is it possible for the user_experience level bucketing of these latest metrics to match the user_experience level bucketing in the VisualEditor on mobile report so we can compare the two?
  • 4. Related to the "3.", what considerations should we have in mind when comparing these latest numbers with those from May 2018? For example: does the fact that we're comparing two different months of data (16 May - 17 June, 2019 vs. May 2018) impact how we compare them? Are we defining edit completion rate now the same way we did last year?
  • 5. What sessions do we categorize as being "edit attempts"? Are they any that reach init?

[Draft] Metrics

Updated version:

A couple questions:

  • 1. For both "Completion rate" metrics – Out[88] and Out[92] – are we able to see the overall completion rate for each interface and platform (i.e. overall edit completion rate for desktop VE, overall edit completion rate for desktop wikitext, overall edit completion rate for mobile VE, overall edit completion rate for mobile wikitext)?

Yes, I've added total to the edges of the tables. Interested to hear if that is comprehensible or if there are better ways to present it.

  • 2. For Out[92], "Desktop completion rate by interface, experience, and VE default status," are we able to see how wikitext and VE edit completion rates for wikis where VE is and is not the default?

Not sure what you're asking—in the first draft, I showed the completion rate for desktop attempts in three buckets: visual editor on VE-default wikis, visual editor on non-VE default wikis, and wikitext editor on all wikis. If you're asking for totals, I've added those now. If you're asking to split the wikitext editor into two buckets based on VE default status, let me know and I can do that.

  • 3. Is it possible for the user_experience level bucketing of these latest metrics to match the user_experience level bucketing in the VisualEditor on mobile report so we can compare the two?

Yeah, good point. I used the new buckets here because they're an emerging standard for assessing user experience on our team, but comparability with the previous numbers is more important. Fixed in this version.

  • 4. Related to the "3.", what considerations should we have in mind when comparing these latest numbers with those from May 2018? For example: does the fact that we're comparing two different months of data (16 May - 17 June, 2019 vs. May 2018) impact how we compare them? Are we defining edit completion rate now the same way we did last year?

Yes, edit completion uses the same definition in both reports: the proportion of attempts (anything that reaches "ready) that reach save success. Otherwise, I can't think of anything significant. There haven't been any significant changes in which editors are the default. There might be a bit of seasonal variation but overall I doubt it. Any significant changes in the edit completion rate are probably "real".

Next week, I'm going to try to do a graph of edit completion rate over time, but it won't go back to the time period of the baseline report because our data format has changed so much. I'll look at trying to manually add a comparison with the baseline values.

  • 5. What sessions do we categorize as being "edit attempts"? Are they any that reach init?

Any that reach ready. This corrects for the issue in T212253 and also means that our number of attempts is the denominator of the edit completion rate.

Thanks, Neil. Comments below.

In T223339#5274612, @Neil_P._Quinn_WMF wrote:

Updated version:

A couple questions:

  • 1. For both "Completion rate" metrics – Out[88] and Out[92] – are we able to see the overall completion rate for each interface and platform (i.e. overall edit completion rate for desktop VE, overall edit completion rate for desktop wikitext, overall edit completion rate for mobile VE, overall edit completion rate for mobile wikitext)?

Yes, I've added total to the edges of the tables. Interested to hear if that is comprehensible or if there are better ways to present it.

This looks good. Assuming I am understanding the table correctly [1], it is comprehensible to me: the bottom-most row of Out[277] represents the overall edit completion rate by platform and editing interface, across experience levels. The right-most column of Out[277] represents the overall edit completion rate by experience level, across platforms and editing interfaces.

  • 2. For Out[92], "Desktop completion rate by interface, experience, and VE default status," are we able to see how wikitext and VE edit completion rates for wikis where VE is and is not the default?

Not sure what you're asking—in the first draft, I showed the completion rate for desktop attempts in three buckets: visual editor on VE-default wikis, visual editor on non-VE default wikis, and wikitext editor on all wikis. If you're asking for totals, I've added those now. If you're asking to split the wikitext editor into two buckets based on VE default status, let me know and I can do that.

Ah, sorry this was not more clear. What you are describing here is what I'm asking for: "If you're asking to split the wikitext editor into two buckets based on VE default status..."

  • 3. Is it possible for the user_experience level bucketing of these latest metrics to match the user_experience level bucketing in the VisualEditor on mobile report so we can compare the two?

Yeah, good point. I used the new buckets here because they're an emerging standard for assessing user experience on our team, but comparability with the previous numbers is more important. Fixed in this version.

Got it. That context is helpful. And to your second point, yes, "...comparability with the previous numbers is more important." is our priority with this analysis.

See the addition of questions "6." + "7." (below) which relate to this "comparability" point.

  • 4. Related to the "3.", what considerations should we have in mind when comparing these latest numbers with those from May 2018? For example: does the fact that we're comparing two different months of data (16 May - 17 June, 2019 vs. May 2018) impact how we compare them? Are we defining edit completion rate now the same way we did last year?

Yes, edit completion uses the same definition in both reports: the proportion of attempts (anything that reaches "ready) that reach save success. Otherwise, I can't think of anything significant. There haven't been any significant changes in which editors are the default. There might be a bit of seasonal variation but overall I doubt it. Any significant changes in the edit completion rate are probably "real".

Noted. Thank you for explaining this.

See "7." below RE "There haven't been any significant changes in which editors are the default.

Next week, I'm going to try to do a graph of edit completion rate over time, but it won't go back to the time period of the baseline report because our data format has changed so much. I'll look at trying to manually add a comparison with the baseline values.

Awesome. Seeing edit completion rate, by platform (mobile, desktop) and editing interface (wikitext, VE) graphed over time would be helpful. I created a new task for this: T226573. Tho, if you'd rather keep this work contained within this tak, let me know and I'll close T226573,.

  • 5. What sessions do we categorize as being "edit attempts"? Are they any that reach init?

Any that reach ready. This corrects for the issue in T212253 and also means that our number of attempts is the denominator of the edit completion rate.

Noted.

  • 6. For the sake of comparability, are you able to consolidate "wikitext-2017" and "wikitext" into one metric, across all three tables: Out[276], Out[277] and Out[330]?
  • 7. For the sake of comparability, are you able to exclude edit attempts initiated by clicks on the red links from all desktop VE edit attempt and completion rate metrics?* I ask this question with the following in mind:
    • a. On ~19-May, we made a change that made it so clicks on red links open the user's preferred editor (read: the last editor they used). See: T223793
      • Before this patch was deployed, all red link clicks opened the wikitext editor.
    • b. We think this change resulted in a 2x increase in VE editor loads. See: T225684
      • These "loads" (read: edit attempts) would not have been included in last year's VE edit completion rate numbers.
    • c. It is reasonable to assume contributors clicking on red links are not intending to edit
    • d. I understand "a." through "c." to mean:
      • Our desktop VE edit completion rate now incorporates edit attempts that are not actually edit attempts and
      • Our 2018 and 2019 desktop VE edit completion rates are less comparable than we first thought

*Please tell me if you see issues with this logic.

Related: As we think about creating a definition for edit completion rate, do you think it makes sense to exclude all desktop editor loads (for both wikitext and VE) initiated by clicks on red links? See: T223498#5247438

@Neil_P._Quinn_WMF Just to make sure we're clear on Q2:

What we want is this table -- https://www.mediawiki.org/wiki/VisualEditor_on_mobile_report#Desktop -- split into two.

First table, for VE-default wikis, 2 columns of numbers: Visual Editor / Wikitext. 7 rows of experience level: Overall / IP editor / 0 edits / 1-9 edits / 10-99 edits / 100-999 edits / 1000+ edits.

Second table, for wikitext-default wikis, 2 columns of numbers: Visual Editor / Wikitext. 7 rows of experience level: Overall / IP editor / 0 edits / 1-9 edits / 10-99 edits / 100-999 edits / 1000+ edits.

We need VE and Wikitext numbers for each case (not combined) because we're comparing whether people do better or worse using Visual Editor compared to wikitext on those wikis. If the default is different, then we've got a different population of people using VE and wikitext -- more newbies will be using the default, so we'd expect higher chance of abandonment for the default on each wiki.

Also, please don't include 2017-wikitext here -- I don't know how many people are using it, and we didn't split that out originally, so it'll make it harder to compare the new numbers with last year's.

Quick update: I was out sick Monday through Wednesday last week, and since then I have been working primarily on the verifying that the mobile VE experiment has been deployed correctly (T221197), since @ppelberg identified that as the highest priority. I will pick this after that is taken care of, and I expect to have it done by the end of the week.

This looks good. Assuming I am understanding the table correctly [1], it is comprehensible to me: the bottom-most row of Out[277] represents the overall edit completion rate by platform and editing interface, across experience levels. The right-most column of Out[277] represents the overall edit completion rate by experience level, across platforms and editing interfaces.

Yes, exactly!

Seeing edit completion rate, by platform (mobile, desktop) and editing interface (wikitext, VE) graphed over time would be helpful. I created a new task for this: T226573. Tho, if you'd rather keep this work contained within this tak, let me know and I'll close T226573,.

No, it's helpful to have it as a separate task to help avoid scope creep. Thanks!

  • 6. For the sake of comparability, are you able to consolidate "wikitext-2017" and "wikitext" into one metric, across all three tables: Out[276], Out[277] and Out[330]?

Actually, in that report, I excluded any sessions that involved the 2017 wikitext editor entirely because a lot of them involved switching editors. In this case, I came up with a better way to handle sessions involving switches (assigning them to the last editor used in the session), so I decided to include them as a useful data. . Since sessions involving switches are an edge case with all the other editors, the wikitext column here is still comparable with the one in the previous report. I can easily remove it, though, and I have no problems doing that since it's not a focus of this analysis.

  • 7. For the sake of comparability, are you able to exclude edit attempts initiated by clicks on the red links from all desktop VE edit attempt and completion rate metrics?* I ask this question with the following in mind:
    • a. On ~19-May, we made a change that made it so clicks on red links open the user's preferred editor (read: the last editor they used). See: T223793
      • Before this patch was deployed, all red link clicks opened the wikitext editor.
    • b. We think this change resulted in a 2x increase in VE editor loads. See: T225684
      • These "loads" (read: edit attempts) would not have been included in last year's VE edit completion rate numbers.
    • c. It is reasonable to assume contributors clicking on red links are not intending to edit
    • d. I understand "a." through "c." to mean:
      • Our desktop VE edit completion rate now incorporates edit attempts that are not actually edit attempts and
      • Our 2018 and 2019 desktop VE edit completion rates are less comparable than we first thought

*Please tell me if you see issues with this logic.

Related: As we think about creating a definition for edit completion rate, do you think it makes sense to exclude all desktop editor loads (for both wikitext and VE) initiated by clicks on red links? See: T223498#5247438

I can definitely see how this would impact the comparability of our numbers, but I don't really think excluding red link clicks from the definition is worth it. To the extent there has been a shift in desktop VE's edit completion rates, that just reflects a real shift caused by making VE available in more cases. There will definitely be a similar shift in mobile VE's overall edit completion where it's turned on by default, because inexperienced people are more likely to use it. I think the best ways of dealing with the fact that the underlying population of users is shifting because of various rollout decisions we're making are:

  • doing A/B tests to give us (snapshots of) the true comparative usability of editors, which edit completion is an imperfect proxy for
  • looking at edit completion segmented by user experience
  • doing more to exclude users who are just exploring and have no real intention of editing from the denominator of the edit completion rate. This would mean changing the definition, but by doing something like changing the denominator from "sessions reaching ready" to "sessions where the user makes at least one change", which in my opinion is better than an ad-hoc exclusion tailored to this one change.

I could also re-run the analysis to see how it differs without red link clicks (assuming that we are logging the data to identify red link clicks—the schema stipulates that, but there might be an instrumentation issue), but it doesn't seem critical since the issue is on desktop rather than mobile.

@Neil_P._Quinn_WMF Just to make sure we're clear on Q2:

What we want is this table -- https://www.mediawiki.org/wiki/VisualEditor_on_mobile_report#Desktop -- split into two.

First table, for VE-default wikis, 2 columns of numbers: Visual Editor / Wikitext. 7 rows of experience level: Overall / IP editor / 0 edits / 1-9 edits / 10-99 edits / 100-999 edits / 1000+ edits.

Second table, for wikitext-default wikis, 2 columns of numbers: Visual Editor / Wikitext. 7 rows of experience level: Overall / IP editor / 0 edits / 1-9 edits / 10-99 edits / 100-999 edits / 1000+ edits.

We need VE and Wikitext numbers for each case (not combined) because we're comparing whether people do better or worse using Visual Editor compared to wikitext on those wikis. If the default is different, then we've got a different population of people using VE and wikitext -- more newbies will be using the default, so we'd expect higher chance of abandonment for the default on each wiki.

Thanks for clarifying that. I'm happy to do that (it's a very simple tweak), but for the record, I don't think it will change the story much.

When I did the initial analysis, I was very aware of the difference in VE's rollout status—for example, see my original notebook from 2018 where I wrote: "However, on desktop this isn't the fairest of comparisons, since the visual editor isn't available on many wikis and many wikis and many namespaces. How does the picture look when we limit it to VE-default wikis? To get the list of VE-default wikis, we start with the "visualeditor-nondefault" database list. On top of that, we need to add the wikis where visual editor is demoted using settings in InitialiseSettings.php: wmgVisualEditorSingleEditTabSecondaryEditor (enwiki and frwiktionary) and wmgVisualEditorDisableForAnons (enwiki and eswiki)."

But I didn't focus on that because looking at registered editors only (where the rollout status is a quite a bit more straightforward) told a very consistent story (desktop VE having a higher completion rate than desktop wikitext in all experience brackets, but mobile VE having a lower rate in all experience brackets) and because our focus is on mobile.

Also, please don't include 2017-wikitext here -- I don't know how many people are using it, and we didn't split that out originally, so it'll make it harder to compare the new numbers with last year's.

See my reply to @ppelberg above.

Update: I've been delayed on this because T221197 has been taking longer than expected—and I just found another concerning data issue there that I haven't been able to explain, so I don't know when exactly I will get back to this task.

kzimmerman lowered the priority of this task from High to Medium.Aug 20 2019, 6:31 PM

Remaining work for this has been deprioritized below other work

kzimmerman added subscribers: MNeisler, nshahquinn-wmf.

Reassigning to @MNeisler since she's now supporting Editing

There are a number of reasons why an edit init might not reach edit ready. However I cannot think of ANY scenario where it would be appropriate to penalize the quick-loading wikitext editor with an edit fail while discarding the data for a VE edit fail. Calculating edit completion rate using ready misleadingly inflates the success percentages for VE.

I am likely to cite this data at some point during community discussions. Some community members might not be inclined to grant the Foundation an assumption of good faith on this. Can someone please either fix the data to use init, or provide a really really good explanation why we shouldn't log a VE edit fail when a particularly slow VE load causes the user to quit on us?

There are a number of reasons why an edit init might not reach edit ready. However I cannot think of ANY scenario where it would be appropriate to penalize the quick-loading wikitext editor with an edit fail while discarding the data for a VE edit fail. Calculating edit completion rate using ready misleadingly inflates the success percentages for VE.

I am likely to cite this data at some point during community discussions. Some community members might not be inclined to grant the Foundation an assumption of good faith on this. Can someone please either fix the data to use init, or provide a really really good explanation why we shouldn't log a VE edit fail when a particularly slow VE load causes the user to quit on us?

The reason the Editing team and I originally defined the edit completion rate as the ratio of completed edits to sessions reaching ready was that it would allow us to focus on the utility of the editing interface, while the ratio of sessions reaching ready to sessions reaching init(which I called the ready rate) would provide a separate metric more focused on load performance. Both, of course, are important to the overall utility of the editor.

In mid-2018, my analysis of the ready rate showed that it didn't differ much between the two mobile editors (95.4% for VE and 97% for wikitext), which led us to focus on the edit completion rate rather than the ready rate in future work. However, in mid-2019, I discovered that the ready rate for VE was actually much lower because a bug in its instrumentation had been fixed in the meantime (T221197#5321675). That made us refocus on VE's loading performance, although Ed Sanders did some analysis that suggested that the lower ready rate was because people were aborting at the same rate as they would have in the wikitext editor, but the delay in getting to ready meant that those edits were happening before ready rather than after ready, but still without making changes (T227930#5338910).

Still, when I analyzed data from our initial A/B test of the two mobile editors, I looked at the full ratio of completed sessions to initiated sessions, just how you are asking for it to be defined, and found that wikitext had a somewhat higher ratio (T229426#5468481). I'm not working on rerunning the A/B test (T235101), which we are doing because we found an instrumentation issue in the initial test that might have artificially excluded VE sessions that took a long time to load (T232237). However, I believe @MNeisler and @ppelberg, who are working on it, are planning to look at that full ratio as the key metric for the experiment.

We aren't going to update the visual editor on mobile report since it was a snapshot of our thinking at a specific point in time, but I'm confident that when/if the metrics themselves are re-run (which is what this ticket is about, but isn't a priority), whoever does them will not leave out the full init-to-completion ratio.

We're going to resolve this task, per the discussion @MNeisler and I had today.

Reason being: if/when we decide to prioritize work on reevaluating peoples' experiences with the mobile visual editor, we will create a new task for that work.