Page MenuHomePhabricator

VE mobile default: A/B test post-deployment data checks
Closed, ResolvedPublic

Description

This task involves the work with making sure we are logging data in such a way that we will be able to analyze the impact of the VE as default A/B test.

Outstanding issues

Test metrics

MeasurementStatus
Edit completion rate✅ Confirmed - data is being logged as we expect to measure this
Total number of completed edits✅ Confirmed - data is being logged as we expect to measure this
Time to save an edit⏳ Not confirmed - we will know after we re-run revisionID query next week
Edit size⏳Not confirmed - we will know after we re-run revisionID query next week
Editor retention✅ Confirmed - data is being logged as we expect to measure this
Edit quality⏳Not confirmed - we will know after we re-run revisionID query
Editing interface switching✅ Confirmed - data is being logged as we expect to measure this

Checks

See my Jupyter notebook.

Related Objects

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

In order to look at the size of edits and whether they were reverted, saveSuccess events need to have the new revision ID in the revision_id field. Since an attempt's events before saveSuccess should have the parent revision ID in that field, a quick way to check this is to make sure that all saveSuccess events have a larger revision ID than all previous events in the session.

with saves as (
    select
        event.editing_session_id as attempt_id,
        event.revision_id as revision_id,
        event.platform as platform,
        event.editor_interface as editor
    from event.editattemptstep
    where
        event.action = "saveSuccess" and
        year = 2019 and month = 6 and
        -- Remove Flow and other non-standard edits
        event.integration = "page"
),
pre_saves as (
    select
        event.editing_session_id as attempt_id,
        max(event.revision_id) as max_revision_id
    from event.editattemptstep
    where
        event.action != "saveSuccess" and
        year = 2019 and month = 6 and
        -- Remove Flow and other non-standard edits
        event.integration = "page"
    group by event.editing_session_id
)
select
    platform,
    editor,
    concat(round((
        sum(cast(saves.revision_id > pre_saves.max_revision_id as int)) * 100 / count(*)
    ), 1), "%") as save_has_greater_revision_id
from saves 
left join pre_saves
on saves.attempt_id = pre_saves.attempt_id
group by
    platform,
    editor

And it looks like like we don't always log the new revision ID on desktop wikitext and never log it on other platforms.

platformeditorsaveSuccess has larger revision ID
desktopvisualeditor0.0%
desktopwikitext96.4%
desktopwikitext-20170.0%
phonevisualeditor0.0%
phonewikitext0.0%

Issue filed as T226847.

See also T221191#5297909, where I confirmed @DLynch's crucial finding that our instrumentation of switching between interfaces was totally broken. I will want to repeat that query once his patches have landed.

nshahquinn-wmf moved this task from Triage to Doing on the Product-Analytics board.

Thanks to @DLynch's work in T226847, the revisionID issues on mobile and desktop should now be resolved. Meaning: we will be able to measure the edit revert rates and size of edits in the test and control groups.

One big initial finding:

At some point on 25 June, we entirely stopped logging events for registered users on mobile (across all wikis, not just ones in the A/B test). Desktop users seem unaffected, and unregistered mobile users seem to have been partially affected.

Events by platform and registration status:

select
    date_format(dt, "yyyy-MM-dd") as date,
    sum(cast(event.platform = "desktop" and event.user_id != 0 as int)) as registered_desktop,
    sum(cast(event.platform = "desktop" and event.user_id = 0 as int)) as anonymous_desktop,
    sum(cast(event.platform = "phone" and event.user_id != 0 as int)) as registered_phone,
    sum(cast(event.platform = "phone" and event.user_id = 0 as int)) as anonymous_phone
from event.editattemptstep
where
    year = 2019 and (
        (month = 6 and day > 23) or
        (month = 7)
    )
group by date_format(dt, "yyyy-MM-dd")
dateregistered_desktopanonymous_desktopregistered_phoneanonymous_phone
2019-06-2410848723455433555299146
2019-06-2511179122248226893227859
2019-06-26107107232324551410
2019-06-2710127821292511480
2019-06-2810061121160807600
2019-06-2996483179354062548
2019-06-30104681189258091495
2019-07-01114615227053097624
2019-07-021097742275180106471

Another, less urgent issue I noticed is that we're logging all mobile VE events, but only 1/16 mobile wikitext events. This means we're throwing away a lot of data and reducing our statistical power. Luckily, @DLynch thinks it will be easy to turn on that oversampling.

Change 520645 had a related patch set uploaded (by DLynch; owner: DLynch):
[mediawiki/extensions/MobileFrontend@master] schemaEditAttemptStep: only set bucket and anonymous-user-token on defaults if non-null

https://gerrit.wikimedia.org/r/520645

To copy from chat:

Post-mortem: The original patch didn't account for EventLogging's validation behavior. It assumed that key: null/undefined was equivalent to key not being set in the data in the first place for non-required schema items. Unfortunately, this was not the case, and so the event would fail validation for people who weren't in the A/B test.
Which failed to be caught in testing in two ways: one, it only happened if you were the very opposite of the group the patch was meant to affect, and two, you have to jump through quite a few hoops to see EventLogging validation messages.

Change 520645 merged by jenkins-bot:
[mediawiki/extensions/MobileFrontend@master] schemaEditAttemptStep: only set bucket and anonymous-user-token on defaults if non-null

https://gerrit.wikimedia.org/r/520645

Change 520649 had a related patch set uploaded (by Jforrester; owner: DLynch):
[mediawiki/extensions/MobileFrontend@wmf/1.34.0-wmf.11] schemaEditAttemptStep: only set bucket and anonymous-user-token on defaults if non-null

https://gerrit.wikimedia.org/r/520649

Change 520649 merged by jenkins-bot:
[mediawiki/extensions/MobileFrontend@wmf/1.34.0-wmf.11] schemaEditAttemptStep: only set bucket and anonymous-user-token on defaults if non-null

https://gerrit.wikimedia.org/r/520649

Mentioned in SAL (#wikimedia-operations) [2019-07-03T23:00:00Z] <jforrester@deploy1001> Synchronized php-1.34.0-wmf.11/extensions/MobileFrontend/resources/dist/: T221197 schemaEditAttemptStep: only set bucket and anonymous-user-token on defaults if non-null (duration: 00m 51s)

To copy from chat:

Post-mortem: The original patch didn't account for EventLogging's validation behavior. It assumed that key: null/undefined was equivalent to key not being set in the data in the first place for non-required schema items. Unfortunately, this was not the case, and so the event would fail validation for people who weren't in the A/B test.
Which failed to be caught in testing in two ways: one, it only happened if you were the very opposite of the group the patch was meant to affect, and two, you have to jump through quite a few hoops to see EventLogging validation messages.

Wow, this is very counter-intuitive behavior. To amplify, our bucket and anonymous_user_token fields are defined as non-required string fields in the schema. But apparently, sending null values for those fields fails validation (which means the entire event is thrown away) since nulls are not strings. This happens even though not sending a value at all is perfectly fine and results in a null value in the database (I actually could've easily figured this out by checking the eventerror data stream, but didn't think of it).

I can't find this explained on any of the EventLogging documentation pages, although even if it was there I wouldn't have expected anybody to read it since those pages in general are so badly out of date.

At any rate, thanks @DLynch for the quick fix and post-mortem!

nshahquinn-wmf renamed this task from VE mobile default: post-deployment QA A/B test to VE mobile default: post-deployment data checks.Jul 5 2019, 11:09 AM
nshahquinn-wmf renamed this task from VE mobile default: post-deployment data checks to VE mobile default: A/B test post-deployment data checks.

Okay, I've now done all the initial checks I'm planning to do (although I will still re-run some of these checks after the fixes land). If you have any idea for other checks I should do, please let me know!

Awesome, Neil. A few quick questions/confirmations:

Meta
Would i be correct to assume the outcome of our checks suggest the following are true?

MeasurementStatus
Edit completion rate✅ Confirmed - data is being logged as we expect to measure this
Total number of completed edits✅ Confirmed - data is being logged as we expect to measure this
Time to save an edit⏳Not confirmed - we will know after we re-run revisionID query next week
Edit size⏳Not confirmed - we will know after we re-run revisionID query next week
Editor retention✅ Confirmed - data is being logged as we expect to measure this
Edit quality⏳Not confirmed - we will know after we re-run revisionID query next week
Editing interface switching⏳Not confirmed - we will know after we re-run revisionID query next week

inits vs. ready
Would i be correct to think a check of ready events is unnecessary because we can assume if inits are being logged properly, ready events will be too? i ask this to be doubly sure we have everything we need to calculate edit completion rate among the two test groups

Editor Switching
Is the below the right way to read the table included in ticket?

  • On 2019-07-05, of mobile VE edit sessions that reached ready, 4.5% of those sessions involved a contributor switching to the mobile wikitext editor.

Awesome, Neil. A few quick questions/confirmations:

Meta
Would i be correct to assume the outcome of our checks suggest the following are true?

Yes, except that:

Editing interface switching⏳Not confirmed - we will know after we re-run revisionID query next week

should actually be confirmed, since the switches query shows that we're logging this data for the mobile editors. We don't need revision ID, and although I do want to re-run the query, that's just to check that the desktop data (which we don't need for this experiment) starts appearing as expected.

inits vs. ready
Would i be correct to think a check of ready events is unnecessary because we can assume if inits are being logged properly, ready events will be too? i ask this to be doubly sure we have everything we need to calculate edit completion rate among the two test groups

Yeah, my semi-conscious thinking was that it wasn't necessary, but I do agree that it's good to be doubly sure and so I've added a check of readies (and good thing too—I found an issue which I'll describe in my next comment).

Editor Switching
Is the below the right way to read the table included in ticket?

  • On 2019-07-05, of mobile VE edit sessions that reached ready, 4.5% of those sessions involved a contributor switching to the mobile wikitext editor.

Yes, correct!

@DLynch, while doing an extra check of ready counts as suggested by @ppelberg, I noticed that the default-visualeditor bucket is only logging 60% as many ready events as the default-source bucket among our test population (when factoring out oversampling). That's big and concerning, whether it's due to real user behavior or instrumentation issues.

This prompted me to look back at the overall counts of init and saveSuccess events in the two buckets and, although the difference is smaller, the visual editor group has fewer of each too. It has only 90% as many inits and 75% as many saves.

I don't know what to make of it. This seems too large a difference to be due to user behavior, particularly when our data on edit completion rates by experience level suggest nothing of the sort. On the other hand, the fact that it's directionally consistent but not identical across different event types suggest it's not an instrumentation issue.

I'll keep investigating tomorrow, but in the meantime, see if you have any ideas 🙂

I also re-ran the "events by platform and registration status" check to get several more days of data and make a graph. That confirms that the issue with registered mobile editing is fixed.

In T221197#5315023, @Neil_P._Quinn_WMF wrote:

Meta
Would i be correct to assume the outcome of our checks suggest the following are true...

Yes, except that:

Editing interface switching⏳Not confirmed - we will know after we re-run revisionID query next week

should actually be confirmed, since the switches query shows that we're logging this data for the mobile editors...

Ok, great. Your response is leading me to realize I mistakenly asked about re-running the query once the revisionID patch landed when I'd meant to ask about the switching patch: T221191#5297909. Either way, it sounds like we're all good here. ✅

inits vs. ready
Would i be correct to think a check of ready events...

Yeah, my semi-conscious thinking was that it wasn't necessary, but I do agree that it's good to be doubly sure and so I've added a check of readies (and good thing too—I found an issue which I'll describe in my next comment).

Good catch. I'm going to respond to T221197#5315064 separately.

Editor Switching
Is the below the right way to read the table included in ticket?

Yes, correct!

Excellent.

In T221197#5315083, @Neil_P._Quinn_WMF wrote:

I also re-ran the "events by platform and registration status" check to get several more days of data and make a graph. That confirms that the issue with registered mobile editing is fixed.

Ok, great. Reference: T221197#5305019

In summary. This looks like our current state:

MeasurementStatus
Edit completion rate⏳Not confirmed - see: T221197#5315064
Total number of completed edits✅ Confirmed - data is being logged as we expect to measure this
Time to save an edit⏳Not confirmed - we will know after we re-run revisionID query this week
Edit size⏳Not confirmed - we will know after we re-run revisionID query this week
Editor retention✅ Confirmed - data is being logged as we expect to measure this
Edit quality⏳Not confirmed - we will know after we re-run revisionID query this week
Editing interface switching✅ Confirmed - data is being logged as we expect to measure
In T221197#5315064, @Neil_P._Quinn_WMF wrote:

...I noticed that the default-visualeditor bucket is only logging 60% as many ready events as the default-source bucket among our test population (when factoring out oversampling). That's big and concerning, whether it's due to real user behavior or instrumentation issues.

Does a ready event fire before or after a contributor taps the "Start editing" button at the bottom of the "Welcome to Wikipedia" overlay [1]? I ask this wondering: To what extent could this additional step be the reason why we are seeing fewer default VE ready events than default source ready events? I appreciate you go on to say, "...This seems too large a difference to be due to user behavior..." but I'm just wanting to be sure.


  1. Welcome to Wikipedia overlay:

image.png (2×1 px, 386 KB)

@DLynch, while doing an extra check of ready counts as suggested by @ppelberg, I noticed that the default-visualeditor bucket is only logging 60% as many ready events as the default-source bucket among our test population (when factoring out oversampling). That's big and concerning, whether it's due to real user behavior or instrumentation issues.

The notable difference between source and visual is that visual has a longer loading time between init and ready. Users who're quickily abandoning their idea to edit would be reflected in that more easily for visual.

Does a ready event fire before or after a contributor taps the "Start editing" button at the bottom of the "Welcome to Wikipedia" overlay [1]?

Before. Those overlays are covering up the loaded editor, not delaying the load.

Though as a confounding factor for overall edit-success rates, visual does get two of those initial dialogs when source only gets one...

The notable difference between source and visual is that visual has a longer loading time between init and ready. Users who're quickily abandoning their idea to edit would be reflected in that more easily for visual.

What would be a way for us to test whether this is the cause for what's happening? My first thought: Do we see a similar proportion of contributors drop-off between init and ready events on desktop on wikis where VE is the default?

...we looked at drop-off along the edit funnel in last year's VE on mobile report, tho I'm not currently seeing these numbers explicitly for wikis where VE is the desktop default.

Before. Those overlays are covering up the loaded editor, not delaying the load.

Understood.

Though as a confounding factor for overall edit-success rates, visual does get two of those initial dialogs when source only gets one...

Noted. Following up on this. First in chat where we had an active thread about the "Welcome" overlay.


Edit: @kzimmerman raised a good question in relation to @DLynch's mention of load times and their potential impact on the high drop-off rate between init and ready in mobile VE:

Does the Editing Team have data on mobile VE load performance on lower-powered devices on slower internet connections?

👆Can you think of anywhere else – besides the tickets below – where this information lives?

In T221197#5315064, @Neil_P._Quinn_WMF wrote:

@DLynch, while doing an extra check of ready counts as suggested by @ppelberg, I noticed that the default-visualeditor bucket is only logging 60% as many ready events as the default-source bucket among our test population (when factoring out oversampling). That's big and concerning, whether it's due to real user behavior or instrumentation issues.

I worry this might be caused by T213214: Visual Editor gets stuck opening article (net::ERR_SPDY_PROTOCOL_ERROR 200/Loading failed for the <script> with source ...). I often encounter the issue when trying to edit on a wiki I visit for the first time. I hadn't connected this before, but it would presumably affect new users because of this.

Change 521575 had a related patch set uploaded (by Bartosz Dziewoński; owner: Bartosz Dziewoński):
[mediawiki/extensions/VisualEditor@master] Break up our massive load.php request to work around network issues

https://gerrit.wikimedia.org/r/521575

After this is deployed, we should check the effects (separately for the A/B test users, for mobile in general, and for desktop in general):

  • Does it increase the number of VE sessions that get to the 'ready' step?
  • Does it increase the VE load times, and if so, by how much?

If it doesn't improve the problem with missing 'ready' events, then we have to look for another solution (and we should revert it).

If it does improve it, then we should look at the load times. If they are not really affected, we should just keep it this way, and somehow implement a load.php response size limit more permanently in our configuration for everything else. If they are made worse, then we should ask SRE to look into T213214 again and find a better solution for it.

Okay, so, here are my latest findings. I've stopped trying to keep the description updated; instead, looking at my Jupyter notebook will be the best way to see my full process.

In the default-visual bucket, a lot fewer sessions reach ready (let's call the proportion that do the "ready rate). The daily ready rates in each bucket (excluding oversampled sessions) are as follows:

datedefault-sourcedefault-visual
2019-06-2898.8%77.0%
2019-06-2999.6%68.3%
2019-06-3099.7%64.2%
2019-07-0199.5%65.4%
2019-07-0299.5%63.0%
2019-07-0399.4%63.3%
2019-07-0499.5%65.6%
2019-07-0599.4%66.6%
2019-07-0699.7%65.9%
2019-07-0799.5%66.5%
2019-07-0899.5%64.4%

This is quite different from what I found in the VisualEditor on mobile report; the ready rate for the mobile visual there was about 95%. The difference may be that the old number was based on a self-selected group of VE users, while this new number is based on a more representative group. Alternatively, it could be the result of a regression during the past year.

In any case, the reason is probably performance: the mobile visual editor takes much longer to load than the mobile wikitext editor. Here are the key percentiles of ready_time values from the A/B test population; the results across all phone sessions are pretty much identical. Times are in milliseconds and oversampled sessions are included.

indexvisualeditorwikitext
10th_percentile1136218
median2868560
90th_percentile79172020

The lower ready rate in the default-visual bucket won't affect the edit completion rate, because any sessions that fail to reach ready are not factored into it (although this issues raises the question of whether that should be the case).

However, I'm also seeing an odd phenomenon in the edit completion rate (aggregated since 4 July, to remove possible effects of the null-bucket bug). When I look it without oversampled sessions, the default-visual bucket has an advantage:

bucketedit_completion_rate
default-source3.1%
default-visual4.0%

But when I include oversampled sessions, the default-source bucket has a massive advantage:

bucketedit_completion_rate
default-source8.0%
default-visual2.7%

This doesn't make sense to me. Currently, the only sessions we are oversampling are a random 15/16 of mobile VE sessions. Including those shouldn't change the rate, because the sampling should be random and because the rate should be indifferent to the total number of sessions. Moreover, even if including them did change the rate, it should only affect the visual editor completion rate (and therefore have only a tiny impact on the default-source bucket).

Overall, I'm confused, but I'll keep working on this tomorrow. Some of my ideas about what to do next:

  • Check whether including or excluding oversampled sessions affects the ready rate or ready timings
  • Check loading time and ready rate over time, to see if there was a performance regression in mobile VE
  • Confirm my assumptions about which sessions are being oversampled
  • See if some of these phenomena differ by user experience level

Please add any thoughts you have, and let me know if there's any line of investigation you think I should prioritize.

In T221197#5319337, @Neil_P._Quinn_WMF wrote:

In the default-visual bucket, a lot fewer sessions reach ready (let's call the proportion that do the "ready rate).

...

This is quite different from what I found in the VisualEditor on mobile report; the ready rate for the mobile visual there was about 95%. The difference may be that the old number was based on a self-selected group of VE users, while this new number is based on a more representative group. Alternatively, it could be the result of a regression during the past year.

In any case, the reason is probably performance: the mobile visual editor takes much longer to load than the mobile wikitext editor.

It turns out this decline in mobile VE's ready rate is due to a change during the past year, but it was a metrics improvement rather than a performance regression. In May 2019, Ed fixed T217825: Editing timing data incorrect for second load (which made all ready timings, not just the second one in a row, incorrectly low).

So it looks like the markedly worse performance numbers and ready rate are real—they just didn't appear that way a year ago because of instrumentation problems. Unfortunately, that throws this whole A/B test into question because we now know that mobile VE's load performance is a lot worse than mobile wikitext's, and that's having a major negative impact on the edit funnel (even if it wouldn't show up in our key metric of edit completion rate).

ready rate.png (455×870 px, 37 KB)

ready timings.png (455×875 px, 69 KB)

T220697: Editing timing data incorrect/meaningless when switching editor on mobile may also play a role some here, although probably not too much, because this performance issue didn't change much when we rolled out the A/B test and added a bunch of mobile VE users who didn't ever have to switch to VE to use it.

I worry this might be caused by T213214: Visual Editor gets stuck opening article (net::ERR_SPDY_PROTOCOL_ERROR 200/Loading failed for the <script> with source ...). I often encounter the issue when trying to edit on a wiki I visit for the first time. I hadn't connected this before, but it would presumably affect new users because of this.

By the way, thank you for this suggestion! It turned out not to be the reason (although it's certainly possible it's having a separate effect on our metrics), but this is exactly the kind of idea I wanted when I posted :)

I've just discovered this bug which is probably having a fairly big impact on performance: T227897. Basically the first edit pencil opens the editor for the whole page (it shouldn't) which is many times slower than just loading a section.

After a week of new discoveries, I'd like to make sure we are all on the same page about our answers to the following questions:

  1. What issues have not yet been resolved?
  2. What are our next steps? Who is taking these steps?
  3. What metrics can we currently measure?

Please, @Neil_P._Quinn_WMF and @DLynch, if you see something in the below that's unclear/inaccurate/unexpected, say something...

Issue #1: "ready rate" varies significantly between test buckets

Cause(s): The mobile visual editor taking 3-4x longer to reach ready than the mobile wikitext editor does.
Implications:

  • Technically, the test's core metric – edit completion rate – is unaffected by this issue.
  • There could be a difference in the "type" of contributors who will be using default-visual and default-source considering the difference in load times between the two interfaces

Additional context: T221197#5319337
Open questions + next steps:

Action itemOwnerStatus
What – if any – changes do we make to the A/B test metrics?@ppelbergOpen
What – if any – improvements listed in T227930 should block the start of this test?@ppelbergT227897

Issue #2: oversampling changes edit completion rate

Cause(s): ⚠️Not certain
Implications: If our randomization is not working properly (read: it is not truly random), the accuracy/legitimacy of all of our other of all of our test metrics is put into question.
Additional context: T221197#5319337
Open questions + next steps

Action itemOwnerTicket
More investigation@Neil_P._Quinn_WMFT227931 [1]

  1. T227931: Neil, I broke this out into a separate ticket assuming this would make it "easier" to think about/work on. Please merge it in with this one if this is not the case.

cc @Esanders

In T221197#5319337, @Neil_P._Quinn_WMF wrote:
datedefault-sourcedefault-visual
2019-07-0899.5%64.4%

I've followed up on this on T227930.

TL;DR: The source editor has 99.5% conversion from init to ready because there is no abandon button so users can't stop it loading. If we look at no-change abandons instead, the numbers are comparable between the two editors.

This is quite different from what I found in the VisualEditor on mobile report; the ready rate for the mobile visual there was about 95%. The difference may be that the old number was based on a self-selected group of VE users, while this new number is based on a more representative group. Alternatively, it could be the result of a regression during the past year.

I think this is that this patch landed between that report and this, which fixes the timing of the ready event on mobile: https://gerrit.wikimedia.org/r/c/mediawiki/extensions/MobileFrontend/+/502992

nshahquinn-wmf added a subscriber: MNeisler.

I've checked all the outstanding issues and we're ready to go.

  • We are almost always logging the saved revision ID. Since 12 July, around 96% of save success events in each interface have included new revision IDs (that is, ones greater than those in previous events in the session). I've filed T231024 since it would be good to investigate it at some point, but the issue is rare enough that it won't significantly affect the analysis of this test.
  • We have not affected non-test wikis. We have not recorded any events in the default-visual or default-source bucket at non-test wikis, and the rate of mobile visual edits at non-test wikis did not increase when we deployed the bucket.
  • Oversampling is no longer affecting the edit completion rate. Since 12 July, when we started oversampling all the mobile wikitext sessions at our target wikis, the relationship between edit completion rate in the two buckets has been the same whether the oversampled sessions are included or not. Although this still suggests that our sampling has some issues, we can be confident that they won't affect the A/B test since we're logging all the relevant data rather than sampling. I will deprioritize T227931 but keep it open for possible future work.

This is quite different from what I found in the VisualEditor on mobile report; the ready rate for the mobile visual there was about 95%. The difference may be that the old number was based on a self-selected group of VE users, while this new number is based on a more representative group. Alternatively, it could be the result of a regression during the past year.

I think this is that this patch landed between that report and this, which fixes the timing of the ready event on mobile: https://gerrit.wikimedia.org/r/c/mediawiki/extensions/MobileFrontend/+/502992

Yup, that's what I found! The change in the ready rate matches up exactly with the deployment of that patches. I'll follow up with more details at T227930.

Change 521575 merged by jenkins-bot:
[mediawiki/extensions/VisualEditor@master] Break up our massive load.php request to work around network issues

https://gerrit.wikimedia.org/r/521575