VE mobile default: A/B test post-deployment data checks
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	• ppelberg
	Apr 17 2019, 6:55 AM

Description

This task involves the work with making sure we are logging data in such a way that we will be able to analyze the impact of the VE as default A/B test.

Outstanding issues

T227931: Oversampling changes edit completion rate
Is the test deployed on the right wikis?
Check metrics that depend on revisionID (see table below)

Test metrics

Measurement	Status
Edit completion rate	✅ Confirmed - data is being logged as we expect to measure this
Total number of completed edits	✅ Confirmed - data is being logged as we expect to measure this
Time to save an edit	⏳ Not confirmed - we will know after we re-run revisionID query next week
Edit size	⏳Not confirmed - we will know after we re-run revisionID query next week
Editor retention	✅ Confirmed - data is being logged as we expect to measure this
Edit quality	⏳Not confirmed - we will know after we re-run revisionID query
Editing interface switching	✅ Confirmed - data is being logged as we expect to measure this

Checks

See my Jupyter notebook.

Details

Subject	Repo	Branch	Lines +/-
Break up our massive load.php request to work around network issues	mediawiki/extensions/VisualEditor	master	+53 -0
schemaEditAttemptStep: only set bucket and anonymous-user-token on defaults if non-null	mediawiki/extensions/MobileFrontend	wmf/1.34.0-wmf.11	+10 -4
schemaEditAttemptStep: only set bucket and anonymous-user-token on defaults if non-null	mediawiki/extensions/MobileFrontend	master	+10 -4

Customize query in gerrit

Related Objects
Search...

Status	Subtype	Assigned	Task
Open		None	T255327 [Epic] Evaluate which editing interface should be shown by default
Open		None	T227338 Test visual editor as the default mobile editor on select wikis
Duplicate		• ppelberg	T221187 VE mobile default: create measurement specifications and experiment plan
Resolved		nshahquinn-wmf	T221197 VE mobile default: A/B test post-deployment data checks
Resolved		DLynch	T226847 Mobile editors do not log new revision IDs on saveSuccess
Resolved	Jul 6 2019	• ppelberg	T227099 Post A/B test analysis caveats to project page
Resolved		DLynch	T227317 Oversample mobile wikitext EditAttemptStep events on default editor A/B test wikis
Resolved		• ppelberg	T227897 Editing lede section in VE mobile loads the whole article

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

• marcella moved this task from To Triage to FY 18-19 Q3/Q4 on the VisualEditor board.May 8 2019, 6:02 PM

• ppelberg assigned this task to nshahquinn-wmf.Jun 28 2019, 5:28 PM

• ppelberg edited projects, added VisualEditor (Current work), Product-Analytics; removed VisualEditor.

• ppelberg updated the task description. (Show Details)

• ppelberg added a subscriber: DLynch.

nshahquinn-wmf updated the task description. (Show Details)Jun 28 2019, 5:35 PM

In order to look at the size of edits and whether they were reverted, saveSuccess events need to have the new revision ID in the revision_id field. Since an attempt's events before saveSuccess should have the parent revision ID in that field, a quick way to check this is to make sure that all saveSuccess events have a larger revision ID than all previous events in the session.

with saves as (
    select
        event.editing_session_id as attempt_id,
        event.revision_id as revision_id,
        event.platform as platform,
        event.editor_interface as editor
    from event.editattemptstep
    where
        event.action = "saveSuccess" and
        year = 2019 and month = 6 and
        -- Remove Flow and other non-standard edits
        event.integration = "page"
),
pre_saves as (
    select
        event.editing_session_id as attempt_id,
        max(event.revision_id) as max_revision_id
    from event.editattemptstep
    where
        event.action != "saveSuccess" and
        year = 2019 and month = 6 and
        -- Remove Flow and other non-standard edits
        event.integration = "page"
    group by event.editing_session_id
)
select
    platform,
    editor,
    concat(round((
        sum(cast(saves.revision_id > pre_saves.max_revision_id as int)) * 100 / count(*)
    ), 1), "%") as save_has_greater_revision_id
from saves 
left join pre_saves
on saves.attempt_id = pre_saves.attempt_id
group by
    platform,
    editor

And it looks like like we don't always log the new revision ID on desktop wikitext and never log it on other platforms.

platform	editor	`saveSuccess` has larger revision ID
desktop	visualeditor	0.0%
desktop	wikitext	96.4%
desktop	wikitext-2017	0.0%
phone	visualeditor	0.0%
phone	wikitext	0.0%

Issue filed as T226847.

nshahquinn-wmf mentioned this in T226847: Mobile editors do not log new revision IDs on saveSuccess.Jun 28 2019, 6:34 PM

• ppelberg updated the task description. (Show Details)Jun 28 2019, 6:54 PM

nshahquinn-wmf mentioned this in T223339: Re-run metrics from VE on mobile report .Jul 1 2019, 6:35 PM

See also T221191#5297909, where I confirmed @DLynch's crucial finding that our instrumentation of switching between interfaces was totally broken. I will want to repeat that query once his patches have landed.

• ppelberg mentioned this in T227002: Make sure VE is mobile default on select wikis.Jul 1 2019, 9:42 PM

• ppelberg merged a task: T227002: Make sure VE is mobile default on select wikis.

• ppelberg added subscribers: nshahquinn-wmf, Ryasmeen.

nshahquinn-wmf triaged this task as High priority.Jul 2 2019, 8:46 AM

nshahquinn-wmf moved this task from Triage to Doing on the Product-Analytics board.

• ppelberg closed subtask T226847: Mobile editors do not log new revision IDs on saveSuccess as Resolved.Jul 3 2019, 3:35 PM

Thanks to @DLynch's work in T226847, the revisionID issues on mobile and desktop should now be resolved. Meaning: we will be able to measure the edit revert rates and size of edits in the test and control groups.

One big initial finding:

At some point on 25 June, we entirely stopped logging events for registered users on mobile (across all wikis, not just ones in the A/B test). Desktop users seem unaffected, and unregistered mobile users seem to have been partially affected.

Events by platform and registration status:

select
    date_format(dt, "yyyy-MM-dd") as date,
    sum(cast(event.platform = "desktop" and event.user_id != 0 as int)) as registered_desktop,
    sum(cast(event.platform = "desktop" and event.user_id = 0 as int)) as anonymous_desktop,
    sum(cast(event.platform = "phone" and event.user_id != 0 as int)) as registered_phone,
    sum(cast(event.platform = "phone" and event.user_id = 0 as int)) as anonymous_phone
from event.editattemptstep
where
    year = 2019 and (
        (month = 6 and day > 23) or
        (month = 7)
    )
group by date_format(dt, "yyyy-MM-dd")

date	registered_desktop	anonymous_desktop	registered_phone	anonymous_phone
2019-06-24	108487	234554	33555	299146
2019-06-25	111791	222482	26893	227859
2019-06-26	107107	232324	55	1410
2019-06-27	101278	212925	11	480
2019-06-28	100611	211608	0	7600
2019-06-29	96483	179354	0	62548
2019-06-30	104681	189258	0	91495
2019-07-01	114615	227053	0	97624
2019-07-02	109774	227518	0	106471

Another, less urgent issue I noticed is that we're logging all mobile VE events, but only 1/16 mobile wikitext events. This means we're throwing away a lot of data and reducing our statistical power. Luckily, @DLynch thinks it will be easy to turn on that oversampling.

Change 520645 had a related patch set uploaded (by DLynch; owner: DLynch):
[mediawiki/extensions/MobileFrontend@master] schemaEditAttemptStep: only set bucket and anonymous-user-token on defaults if non-null

https://gerrit.wikimedia.org/r/520645

gerritbot added a project: Patch-For-Review.Jul 3 2019, 10:12 PM

To copy from chat:

Post-mortem: The original patch didn't account for EventLogging's validation behavior. It assumed that key: null/undefined was equivalent to key not being set in the data in the first place for non-required schema items. Unfortunately, this was not the case, and so the event would fail validation for people who weren't in the A/B test.
Which failed to be caught in testing in two ways: one, it only happened if you were the very opposite of the group the patch was meant to affect, and two, you have to jump through quite a few hoops to see EventLogging validation messages.

Change 520645 merged by jenkins-bot:
[mediawiki/extensions/MobileFrontend@master] schemaEditAttemptStep: only set bucket and anonymous-user-token on defaults if non-null

https://gerrit.wikimedia.org/r/520645

Change 520649 had a related patch set uploaded (by Jforrester; owner: DLynch):
[mediawiki/extensions/MobileFrontend@wmf/1.34.0-wmf.11] schemaEditAttemptStep: only set bucket and anonymous-user-token on defaults if non-null

https://gerrit.wikimedia.org/r/520649

Change 520649 merged by jenkins-bot:
[mediawiki/extensions/MobileFrontend@wmf/1.34.0-wmf.11] schemaEditAttemptStep: only set bucket and anonymous-user-token on defaults if non-null

https://gerrit.wikimedia.org/r/520649

Mentioned in SAL (#wikimedia-operations) [2019-07-03T23:00:00Z] <jforrester@deploy1001> Synchronized php-1.34.0-wmf.11/extensions/MobileFrontend/resources/dist/: T221197 schemaEditAttemptStep: only set bucket and anonymous-user-token on defaults if non-null (duration: 00m 51s)

Maintenance_bot removed a project: Patch-For-Review.Jul 3 2019, 11:10 PM

• ppelberg mentioned this in T227099: Post A/B test analysis caveats to project page.Jul 3 2019, 11:51 PM

• ppelberg closed subtask T227099: Post A/B test analysis caveats to project page as Resolved.

In T221197#5305169, @DLynch wrote:

To copy from chat:

Post-mortem: The original patch didn't account for EventLogging's validation behavior. It assumed that key: null/undefined was equivalent to key not being set in the data in the first place for non-required schema items. Unfortunately, this was not the case, and so the event would fail validation for people who weren't in the A/B test.
Which failed to be caught in testing in two ways: one, it only happened if you were the very opposite of the group the patch was meant to affect, and two, you have to jump through quite a few hoops to see EventLogging validation messages.

Wow, this is very counter-intuitive behavior. To amplify, our bucket and anonymous_user_token fields are defined as non-required string fields in the schema. But apparently, sending null values for those fields fails validation (which means the entire event is thrown away) since nulls are not strings. This happens even though not sending a value at all is perfectly fine and results in a null value in the database (I actually could've easily figured this out by checking the eventerror data stream, but didn't think of it).

I can't find this explained on any of the EventLogging documentation pages, although even if it was there I wouldn't have expected anybody to read it since those pages in general are so badly out of date.

At any rate, thanks @DLynch for the quick fix and post-mortem!

nshahquinn-wmf updated the task description. (Show Details)Jul 4 2019, 7:11 PM

nshahquinn-wmf updated the task description. (Show Details)Jul 4 2019, 8:42 PM

Restricted Application added subscribers: • Petar.petkovic, jeblad. · View Herald TranscriptJul 4 2019, 8:42 PM

nshahquinn-wmf updated the task description. (Show Details)Jul 4 2019, 8:44 PM

nshahquinn-wmf renamed this task from VE mobile default: post-deployment QA A/B test to VE mobile default: post-deployment data checks.Jul 5 2019, 11:09 AM

nshahquinn-wmf renamed this task from VE mobile default: post-deployment data checks to VE mobile default: A/B test post-deployment data checks.

nshahquinn-wmf updated the task description. (Show Details)Jul 5 2019, 7:11 PM

Okay, I've now done all the initial checks I'm planning to do (although I will still re-run some of these checks after the fixes land). If you have any idea for other checks I should do, please let me know!

nshahquinn-wmf moved this task from Doing to Epics on the Product-Analytics board.Jul 5 2019, 7:14 PM

Awesome, Neil. A few quick questions/confirmations:

Meta
Would i be correct to assume the outcome of our checks suggest the following are true?

Measurement	Status
Edit completion rate	✅ Confirmed - data is being logged as we expect to measure this
Total number of completed edits	✅ Confirmed - data is being logged as we expect to measure this
Time to save an edit	⏳Not confirmed - we will know after we re-run revisionID query next week
Edit size	⏳Not confirmed - we will know after we re-run revisionID query next week
Editor retention	✅ Confirmed - data is being logged as we expect to measure this
Edit quality	⏳Not confirmed - we will know after we re-run revisionID query next week
Editing interface switching	⏳Not confirmed - we will know after we re-run revisionID query next week

inits vs. ready
Would i be correct to think a check of ready events is unnecessary because we can assume if inits are being logged properly, ready events will be too? i ask this to be doubly sure we have everything we need to calculate edit completion rate among the two test groups

Editor Switching
Is the below the right way to read the table included in ticket?

On 2019-07-05, of mobile VE edit sessions that reached ready, 4.5% of those sessions involved a contributor switching to the mobile wikitext editor.

• Petar.petkovic unsubscribed.Jul 5 2019, 9:39 PM

nshahquinn-wmf updated the task description. (Show Details)Jul 8 2019, 7:12 PM

nshahquinn-wmf updated the task description. (Show Details)Jul 8 2019, 7:16 PM

nshahquinn-wmf updated the task description. (Show Details)Jul 8 2019, 8:03 PM

In T221197#5310186, @ppelberg wrote:

Awesome, Neil. A few quick questions/confirmations:

Meta
Would i be correct to assume the outcome of our checks suggest the following are true?

Yes, except that:

Editing interface switching ⏳Not confirmed - we will know after we re-run revisionID query next week

should actually be confirmed, since the switches query shows that we're logging this data for the mobile editors. We don't need revision ID, and although I do want to re-run the query, that's just to check that the desktop data (which we don't need for this experiment) starts appearing as expected.

inits vs. ready
Would i be correct to think a check of ready events is unnecessary because we can assume if inits are being logged properly, ready events will be too? i ask this to be doubly sure we have everything we need to calculate edit completion rate among the two test groups

Yeah, my semi-conscious thinking was that it wasn't necessary, but I do agree that it's good to be doubly sure and so I've added a check of readies (and good thing too—I found an issue which I'll describe in my next comment).

Editor Switching
Is the below the right way to read the table included in ticket?

On 2019-07-05, of mobile VE edit sessions that reached ready, 4.5% of those sessions involved a contributor switching to the mobile wikitext editor.

Yes, correct!

@DLynch, while doing an extra check of ready counts as suggested by @ppelberg, I noticed that the default-visualeditor bucket is only logging 60% as many ready events as the default-source bucket among our test population (when factoring out oversampling). That's big and concerning, whether it's due to real user behavior or instrumentation issues.

This prompted me to look back at the overall counts of init and saveSuccess events in the two buckets and, although the difference is smaller, the visual editor group has fewer of each too. It has only 90% as many inits and 75% as many saves.

I don't know what to make of it. This seems too large a difference to be due to user behavior, particularly when our data on edit completion rates by experience level suggest nothing of the sort. On the other hand, the fact that it's directionally consistent but not identical across different event types suggest it's not an instrumentation issue.

I'll keep investigating tomorrow, but in the meantime, see if you have any ideas 🙂

I also re-ran the "events by platform and registration status" check to get several more days of data and make a graph. That confirms that the issue with registered mobile editing is fixed.

In T221197#5315023, @Neil_P._Quinn_WMF wrote:

In T221197#5310186, @ppelberg wrote:

Meta
Would i be correct to assume the outcome of our checks suggest the following are true...

Yes, except that:

Editing interface switching ⏳Not confirmed - we will know after we re-run revisionID query next week

should actually be confirmed, since the switches query shows that we're logging this data for the mobile editors...

Ok, great. Your response is leading me to realize I mistakenly asked about re-running the query once the revisionID patch landed when I'd meant to ask about the switching patch: T221191#5297909. Either way, it sounds like we're all good here. ✅

inits vs. ready
Would i be correct to think a check of ready events...

Yeah, my semi-conscious thinking was that it wasn't necessary, but I do agree that it's good to be doubly sure and so I've added a check of readies (and good thing too—I found an issue which I'll describe in my next comment).

Good catch. I'm going to respond to T221197#5315064 separately.

Editor Switching
Is the below the right way to read the table included in ticket?

Yes, correct!

Excellent.

In T221197#5315083, @Neil_P._Quinn_WMF wrote:

I also re-ran the "events by platform and registration status" check to get several more days of data and make a graph. That confirms that the issue with registered mobile editing is fixed.

Ok, great. Reference: T221197#5305019

In summary. This looks like our current state:

Measurement	Status
Edit completion rate	⏳Not confirmed - see: T221197#5315064
Total number of completed edits	✅ Confirmed - data is being logged as we expect to measure this
Time to save an edit	⏳Not confirmed - we will know after we re-run revisionID query this week
Edit size	⏳Not confirmed - we will know after we re-run revisionID query this week
Editor retention	✅ Confirmed - data is being logged as we expect to measure this
Edit quality	⏳Not confirmed - we will know after we re-run revisionID query this week
Editing interface switching	✅ Confirmed - data is being logged as we expect to measure

In T221197#5315064, @Neil_P._Quinn_WMF wrote:

...I noticed that the default-visualeditor bucket is only logging 60% as many ready events as the default-source bucket among our test population (when factoring out oversampling). That's big and concerning, whether it's due to real user behavior or instrumentation issues.

Does a ready event fire before or after a contributor taps the "Start editing" button at the bottom of the "Welcome to Wikipedia" overlay [1]? I ask this wondering: To what extent could this additional step be the reason why we are seeing fewer default VE ready events than default source ready events? I appreciate you go on to say, "...This seems too large a difference to be due to user behavior..." but I'm just wanting to be sure.

Welcome to Wikipedia overlay:

@DLynch, while doing an extra check of ready counts as suggested by @ppelberg, I noticed that the default-visualeditor bucket is only logging 60% as many ready events as the default-source bucket among our test population (when factoring out oversampling). That's big and concerning, whether it's due to real user behavior or instrumentation issues.

The notable difference between source and visual is that visual has a longer loading time between init and ready. Users who're quickily abandoning their idea to edit would be reflected in that more easily for visual.

Does a ready event fire before or after a contributor taps the "Start editing" button at the bottom of the "Welcome to Wikipedia" overlay [1]?

Before. Those overlays are covering up the loaded editor, not delaying the load.

Though as a confounding factor for overall edit-success rates, visual does get two of those initial dialogs when source only gets one...

In T221197#5315232, @DLynch wrote:

The notable difference between source and visual is that visual has a longer loading time between init and ready. Users who're quickily abandoning their idea to edit would be reflected in that more easily for visual.

What would be a way for us to test whether this is the cause for what's happening? My first thought: Do we see a similar proportion of contributors drop-off between init and ready events on desktop on wikis where VE is the default?

...we looked at drop-off along the edit funnel in last year's VE on mobile report, tho I'm not currently seeing these numbers explicitly for wikis where VE is the desktop default.

In T221197#5315236, @DLynch wrote:

Before. Those overlays are covering up the loaded editor, not delaying the load.

Understood.

Though as a confounding factor for overall edit-success rates, visual does get two of those initial dialogs when source only gets one...

Noted. Following up on this. First in chat where we had an active thread about the "Welcome" overlay.

Edit: @kzimmerman raised a good question in relation to @DLynch's mention of load times and their potential impact on the high drop-off rate between init and ready in mobile VE:

Does the Editing Team have data on mobile VE load performance on lower-powered devices on slower internet connections?

👆Can you think of anywhere else – besides the tickets below – where this information lives?

T206228#5036604
T214450#5006676

cc @Esanders + @dchan

• ppelberg added subscribers: kzimmerman, Esanders, dchan.Jul 9 2019, 12:59 AM

nshahquinn-wmf moved this task from Incoming to In progress on the VisualEditor (Current work) board.Jul 9 2019, 4:52 PM

In T221197#5315064, @Neil_P._Quinn_WMF wrote:

@DLynch, while doing an extra check of ready counts as suggested by @ppelberg, I noticed that the default-visualeditor bucket is only logging 60% as many ready events as the default-source bucket among our test population (when factoring out oversampling). That's big and concerning, whether it's due to real user behavior or instrumentation issues.

I worry this might be caused by T213214: Visual Editor gets stuck opening article (net::ERR_SPDY_PROTOCOL_ERROR 200/Loading failed for the <script> with source ...). I often encounter the issue when trying to edit on a wiki I visit for the first time. I hadn't connected this before, but it would presumably affect new users because of this.

Change 521575 had a related patch set uploaded (by Bartosz Dziewoński; owner: Bartosz Dziewoński):
[mediawiki/extensions/VisualEditor@master] Break up our massive load.php request to work around network issues

https://gerrit.wikimedia.org/r/521575

gerritbot added a project: Patch-For-Review.Jul 9 2019, 7:37 PM

After this is deployed, we should check the effects (separately for the A/B test users, for mobile in general, and for desktop in general):

Does it increase the number of VE sessions that get to the 'ready' step?
Does it increase the VE load times, and if so, by how much?

If it doesn't improve the problem with missing 'ready' events, then we have to look for another solution (and we should revert it).

If it does improve it, then we should look at the load times. If they are not really affected, we should just keep it this way, and somehow implement a load.php response size limit more permanently in our configuration for everything else. If they are made worse, then we should ask SRE to look into T213214 again and find a better solution for it.

Okay, so, here are my latest findings. I've stopped trying to keep the description updated; instead, looking at my Jupyter notebook will be the best way to see my full process.

In the default-visual bucket, a lot fewer sessions reach ready (let's call the proportion that do the "ready rate). The daily ready rates in each bucket (excluding oversampled sessions) are as follows:

date	default-source	default-visual
2019-06-28	98.8%	77.0%
2019-06-29	99.6%	68.3%
2019-06-30	99.7%	64.2%
2019-07-01	99.5%	65.4%
2019-07-02	99.5%	63.0%
2019-07-03	99.4%	63.3%
2019-07-04	99.5%	65.6%
2019-07-05	99.4%	66.6%
2019-07-06	99.7%	65.9%
2019-07-07	99.5%	66.5%
2019-07-08	99.5%	64.4%

This is quite different from what I found in the VisualEditor on mobile report; the ready rate for the mobile visual there was about 95%. The difference may be that the old number was based on a self-selected group of VE users, while this new number is based on a more representative group. Alternatively, it could be the result of a regression during the past year.

In any case, the reason is probably performance: the mobile visual editor takes much longer to load than the mobile wikitext editor. Here are the key percentiles of ready_time values from the A/B test population; the results across all phone sessions are pretty much identical. Times are in milliseconds and oversampled sessions are included.

index	visualeditor	wikitext
10th_percentile	1136	218
median	2868	560
90th_percentile	7917	2020

The lower ready rate in the default-visual bucket won't affect the edit completion rate, because any sessions that fail to reach ready are not factored into it (although this issues raises the question of whether that should be the case).

However, I'm also seeing an odd phenomenon in the edit completion rate (aggregated since 4 July, to remove possible effects of the null-bucket bug). When I look it without oversampled sessions, the default-visual bucket has an advantage:

bucket	edit_completion_rate
default-source	3.1%
default-visual	4.0%

But when I include oversampled sessions, the default-source bucket has a massive advantage:

bucket	edit_completion_rate
default-source	8.0%
default-visual	2.7%

This doesn't make sense to me. Currently, the only sessions we are oversampling are a random 15/16 of mobile VE sessions. Including those shouldn't change the rate, because the sampling should be random and because the rate should be indifferent to the total number of sessions. Moreover, even if including them did change the rate, it should only affect the visual editor completion rate (and therefore have only a tiny impact on the default-source bucket).

Overall, I'm confused, but I'll keep working on this tomorrow. Some of my ideas about what to do next:

Check whether including or excluding oversampled sessions affects the ready rate or ready timings
Check loading time and ready rate over time, to see if there was a performance regression in mobile VE
Confirm my assumptions about which sessions are being oversampled
See if some of these phenomena differ by user experience level

Please add any thoughts you have, and let me know if there's any line of investigation you think I should prioritize.

• ppelberg mentioned this in T227670: Don't show the "Welcome to Wikipedia" overlay on mobile ve.Jul 10 2019, 2:34 PM

In T221197#5319337, @Neil_P._Quinn_WMF wrote:

In the default-visual bucket, a lot fewer sessions reach ready (let's call the proportion that do the "ready rate).

...

This is quite different from what I found in the VisualEditor on mobile report; the ready rate for the mobile visual there was about 95%. The difference may be that the old number was based on a self-selected group of VE users, while this new number is based on a more representative group. Alternatively, it could be the result of a regression during the past year.

In any case, the reason is probably performance: the mobile visual editor takes much longer to load than the mobile wikitext editor.

It turns out this decline in mobile VE's ready rate is due to a change during the past year, but it was a metrics improvement rather than a performance regression. In May 2019, Ed fixed T217825: Editing timing data incorrect for second load (which made all ready timings, not just the second one in a row, incorrectly low).

So it looks like the markedly worse performance numbers and ready rate are real—they just didn't appear that way a year ago because of instrumentation problems. Unfortunately, that throws this whole A/B test into question because we now know that mobile VE's load performance is a lot worse than mobile wikitext's, and that's having a major negative impact on the edit funnel (even if it wouldn't show up in our key metric of edit completion rate).

T220697: Editing timing data incorrect/meaningless when switching editor on mobile may also play a role some here, although probably not too much, because this performance issue didn't change much when we rolled out the A/B test and added a bunch of mobile VE users who didn't ever have to switch to VE to use it.

nshahquinn-wmf mentioned this in T227018: EventLogging Schema errors have increased ~6x.Jul 11 2019, 3:42 PM

In T221197#5318206, @matmarex wrote:

I worry this might be caused by T213214: Visual Editor gets stuck opening article (net::ERR_SPDY_PROTOCOL_ERROR 200/Loading failed for the <script> with source ...). I often encounter the issue when trying to edit on a wiki I visit for the first time. I hadn't connected this before, but it would presumably affect new users because of this.

By the way, thank you for this suggestion! It turned out not to be the reason (although it's certainly possible it's having a separate effect on our metrics), but this is exactly the kind of idea I wanted when I posted :)

nshahquinn-wmf moved this task from Epics to Doing on the Product-Analytics board.Jul 12 2019, 11:33 AM

nshahquinn-wmf mentioned this in T212137: Check impacts of mobile VE load screen improvements.Jul 12 2019, 12:34 PM

I've just discovered this bug which is probably having a fairly big impact on performance: T227897. Basically the first edit pencil opens the editor for the whole page (it shouldn't) which is many times slower than just loading a section.

• ppelberg added a subtask: T227897: Editing lede section in VE mobile loads the whole article.Jul 12 2019, 7:22 PM

• ppelberg mentioned this in T227930: [Epic] Mobile VE loading performance.Jul 13 2019, 12:28 AM

After a week of new discoveries, I'd like to make sure we are all on the same page about our answers to the following questions:

What issues have not yet been resolved?
What are our next steps? Who is taking these steps?
What metrics can we currently measure?

Please, @Neil_P._Quinn_WMF and @DLynch, if you see something in the below that's unclear/inaccurate/unexpected, say something...

Issue #1: "ready rate" varies significantly between test buckets

Cause(s): The mobile visual editor taking 3-4x longer to reach ready than the mobile wikitext editor does.
Implications:

Technically, the test's core metric – edit completion rate – is unaffected by this issue.
There could be a difference in the "type" of contributors who will be using default-visual and default-source considering the difference in load times between the two interfaces

Additional context: T221197#5319337
Open questions + next steps:

Action item	Owner	Status
What – if any – changes do we make to the A/B test metrics?	@ppelberg	Open
What – if any – improvements listed in T227930 should block the start of this test?	@ppelberg	T227897

Issue #2: oversampling changes edit completion rate

Cause(s): ⚠️Not certain
Implications: If our randomization is not working properly (read: it is not truly random), the accuracy/legitimacy of all of our other of all of our test metrics is put into question.
Additional context: T221197#5319337
Open questions + next steps

Action item	Owner	Ticket
More investigation	@Neil_P._Quinn_WMF	T227931 [1]

T227931: Neil, I broke this out into a separate ticket assuming this would make it "easier" to think about/work on. Please merge it in with this one if this is not the case.

cc @Esanders

• ppelberg mentioned this in T227931: Oversampling changes edit completion rate .Jul 13 2019, 12:40 AM

• ppelberg added a subtask: T227931: Oversampling changes edit completion rate .

nshahquinn-wmf updated the task description. (Show Details)Jul 15 2019, 3:09 PM

In T221197#5319337, @Neil_P._Quinn_WMF wrote:

date default-source default-visual

2019-07-08 99.5% 64.4%

date	default-source	default-visual
2019-07-08	99.5%	64.4%

I've followed up on this on T227930.

TL;DR: The source editor has 99.5% conversion from init to ready because there is no abandon button so users can't stop it loading. If we look at no-change abandons instead, the numbers are comparable between the two editors.

• ppelberg updated the task description. (Show Details)Jul 19 2019, 3:30 PM

• ppelberg updated the task description. (Show Details)Jul 24 2019, 6:06 PM

• ppelberg closed subtask T227897: Editing lede section in VE mobile loads the whole article as Resolved.Jul 26 2019, 3:18 PM

This is quite different from what I found in the VisualEditor on mobile report; the ready rate for the mobile visual there was about 95%. The difference may be that the old number was based on a self-selected group of VE users, while this new number is based on a more representative group. Alternatively, it could be the result of a regression during the past year.

I think this is that this patch landed between that report and this, which fixes the timing of the ready event on mobile: https://gerrit.wikimedia.org/r/c/mediawiki/extensions/MobileFrontend/+/502992

I've checked all the outstanding issues and we're ready to go.

We are almost always logging the saved revision ID. Since 12 July, around 96% of save success events in each interface have included new revision IDs (that is, ones greater than those in previous events in the session). I've filed T231024 since it would be good to investigate it at some point, but the issue is rare enough that it won't significantly affect the analysis of this test.
We have not affected non-test wikis. We have not recorded any events in the default-visual or default-source bucket at non-test wikis, and the rate of mobile visual edits at non-test wikis did not increase when we deployed the bucket.
Oversampling is no longer affecting the edit completion rate. Since 12 July, when we started oversampling all the mobile wikitext sessions at our target wikis, the relationship between edit completion rate in the two buckets has been the same whether the oversampled sessions are included or not. Although this still suggests that our sampling has some issues, we can be confident that they won't affect the A/B test since we're logging all the relevant data rather than sampling. I will deprioritize T227931 but keep it open for possible future work.

Restricted Application added a project: User-Ryasmeen. · View Herald TranscriptAug 22 2019, 4:49 PM

In T221197#5396755, @DLynch wrote:

This is quite different from what I found in the VisualEditor on mobile report; the ready rate for the mobile visual there was about 95%. The difference may be that the old number was based on a self-selected group of VE users, while this new number is based on a more representative group. Alternatively, it could be the result of a regression during the past year.

I think this is that this patch landed between that report and this, which fixes the timing of the ready event on mobile: https://gerrit.wikimedia.org/r/c/mediawiki/extensions/MobileFrontend/+/502992

Yup, that's what I found! The change in the ready rate matches up exactly with the deployment of that patches. I'll follow up with more details at T227930.

nshahquinn-wmf closed subtask T227317: Oversample mobile wikitext EditAttemptStep events on default editor A/B test wikis as Resolved.Aug 22 2019, 4:52 PM

nshahquinn-wmf removed a subtask: T227931: Oversampling changes edit completion rate .

Change 521575 merged by jenkins-bot:
[mediawiki/extensions/VisualEditor@master] Break up our massive load.php request to work around network issues

https://gerrit.wikimedia.org/r/521575

ReleaseTaggerBot added a project: MW-1.34-notes (1.34.0-wmf.21; 2019-09-03).Aug 30 2019, 11:00 PM

Maintenance_bot removed a project: Patch-For-Review.Aug 30 2019, 11:11 PM

MNeisler mentioned this in T237075: Investigate drop in mobile VE edit completion rate among unregistered users.Dec 5 2019, 1:25 AM

	F29732394: ready rate.png
	Jul 10 2019, 5:15 PM

	F29732393: ready timings.png
	Jul 10 2019, 5:15 PM

	F29713291: image.png
	Jul 8 2019, 9:28 PM

VE mobile default: A/B test post-deployment data checksClosed, ResolvedPublicActions