Page MenuHomePhabricator

Variant tests: C vs. D analysis
Closed, ResolvedPublic

Description

After the Variant C vs. D test is deployed, we will want it to run for about 5 weeks. Similarly to how we did in T238888: Variant tests: "initiation" test (A vs. B), we then will analyze the results with the goal of choosing the "better" variant, and then giving all newcomers that variant. We will compare the two variants on these metrics:

  • Visits (mobile only): what percent of homepage visitors view the full suggested edits module on mobile? This only applies on mobile because mobile users must tap the module preview or go through onboarding before getting to the full module.
  • Interaction: what percent of homepage visitors interact with the suggested edits module? Note that we want to only count interactions with the fully-initiated suggested edits module, meaning things that happen post-initiation in both variants. The onboarding overlays for Variant C don't count and the topic and difficulty screens for Variant D don't count. We mean just: interacting with the topic or difficulty filter buttons, navigating cards with the arrows, hovering on the "i", or selecting a task. The same goes for mobile.
  • Navigation: what percent of homepage visitors navigate to another task in the module?
  • Task selection: what percent of homepage visitors click on a task to do?
  • Edit success: what percent of newcomers save a suggested edit?

We want to split these four metrics by wiki and platform. We should use all 17 of our Wikipedias for this analysis. They have all had the Growth features since Oct 19, when the experiment started.

When this analysis is finished, we'll want to present it along with baseline numbers from Variant A. An open question: should we use the Variant A numbers from T238888: Variant tests: "initiation" test (A vs. B)? Or should we recalculate new Variant A numbers from September 2020, because (a) a lot of time has passed since March 2020, and (b) the combination of wikis will be very different?

Event Timeline

hi @MMiller_WMF - are we also able to track the following:

  • Interactions with onboarding and 'i" info help screens on var C vs var D – wondering if for example var D users tend to select more topics and task types because it is part of the onboarding?
  • Breakdown of different task types selected – again, to see if the awareness of filtering by task types (and therefore attempts at a greater variety of tasks) on var D might be higher due to it being more prominently part of onboarding?
kzimmerman triaged this task as Medium priority.Nov 5 2020, 6:40 PM
kzimmerman moved this task from Triage to Current Quarter on the Product-Analytics board.

@MMiller_WMF : you've asked me to determine what the duration of the Variant C/D experiment should be. Here's what I've come up with.

1: How much time should we give users to interact with the Homepage and make a tagged edit? In the Discovery Analysis, we use "within 48 hours of registration" for their first visit to the Homepage. We've defined editor activation as editing within 24 hours. The "newcomer task" edit tag applies to edits within 7 days of clicking on a task. In the Variant A/B experiment analysis we used 14 days, which I now think is excessive, we're more interested in interactions relatively quickly after registration.

Of the various measurements we have, interactions with the Homepage should occur relatively quickly, whereas the tagged edit might be lagging somewhat. In this case, we're interested in the user's first tagged edit, so I did a quick analysis of that, finding that the 90th percentile is just over 24 hours (btw, the median is 20 minutes and the 75th percentile is an hour and a half, most users edit quickly). Since it coincides with our definition of activation, I think we should limit all our above measurements to activity within 24 hours of registration.

2: We know from T266610 that the impact of the missing "newcomer task" edit tag is substantial. Because the Variant C/D analysis looks at tagged edits, I think we should exclude the period the bug was in effect. The variant experiment started on Oct 19, whereas the bug fix was deployed on Oct 28, so we shift by a bit more than a week.

3: We're unsure how big the differences between the variants are going to be. This makes it difficult to estimate what statistical power we need. Given that the Homepage is now deployed on several large wikis, it's tempting to go with a short duration. I think we should be cautious and use four weeks (as an approximate month), and revisit that question once the analysis is done.

Based on these, we end up with an experiment start on 2020-10-28, four weeks later is 2020-11-25. We need an additional 24 hours per point 1 above, putting us at 2020-11-26 as the end date for data gathering.

@MMiller_WMF : I've started working on this analysis and went through the questions asked to brainstorm how to gather data for them. This left me with some questions.

  1. These questions are phrased using "newcomers" as the basis for calculating proportions. In the Variant A/B analysis, we used visits to the Homepage as reflected in the HomepageModule schema's data as the basis. Continuing to do so would keep things consistent, and also solves issues like users blocking EventLogging. Also, I think we're mainly interested in what users are doing on the Homepage if they visit it, and not whether they visit it.
  2. When I was digging into the specifications of the four different variants to determine what "interacting with the module" really means, I'm not sure what counts. Variant C on both desktop and mobile has the "popup, topic selection, difficulty selection" funnel, which I think should be ignored. Is that correct? For Variant D, users have to initialize the module first, should that process count as "interaction", or should we start the clock when they've activated the module?
  3. For counting edits, we have done this before in T253902. I'd like to reuse the same approach this time because it's the most accurate one we know about. Since it doesn't use HomepageModule as a basis it'll not be directly comparable to the other measurements, and will be the one measurement that uses "newcomers" as the denominator. Is that going to be a problem?

@nettrom_WMF -- thanks for thinking this through and posting the questions. Here are my responses. I will update the task description.

  1. I agree that we should stay consistent with Variant A/B and just look at homepage visitors. We believe the vast majority of homepage visitors are newcomers. That said, I think we should restrict the analysis to users whose first visit to the homepage occurred after the experiment started. Users who created accounts before the experiment all have Variant D, and many of them had already completed initiation. I don't think we should count them. What should we do there? Restrict to users with registration dates after the experiment? Or users whose first visit was after the experiment?
  2. For interacting with the module, we mean things that happen post-initiation in both variants. The onboarding overlays for Variant C don't count and the topic and difficulty screens for Variant D don't count. We mean just: interacting with the topic or difficulty filter buttons, navigating cards with the arrows, hovering on the "i", or selecting a task. The same goes for mobile.
  3. It is fine that this will only count edits for newcomers.

One other point that I didn't specify in the task description: we should use all 17 of our Wikipedias for this analysis. They have all had the Growth features since Oct 19, when the experiment started.

  1. I agree that we should stay consistent with Variant A/B and just look at homepage visitors. We believe the vast majority of homepage visitors are newcomers. That said, I think we should restrict the analysis to users whose first visit to the homepage occurred after the experiment started. Users who created accounts before the experiment all have Variant D, and many of them had already completed initiation. I don't think we should count them. What should we do there? Restrict to users with registration dates after the experiment? Or users whose first visit was after the experiment?

The plan is to restrict it to users who registered during the experiment (Oct 28 through Nov 25, per the comment above) and only count their actions during the first 24 hours after registration. That ensures they're randomly assigned to a condition so we can use the data to understand the effects of the variants.

Everything else sounds good to go to me, and thanks for remembering that all 17 wikis are included in this experiment! I'll make sure that we use data from all of them.

I updated the task description to reflect that we're investigating whether users made a suggested edit in the last question, as that is also what we measured in T253902.

@MMiller_WMF : I've updated the draft report with the findings from the experiment analysis, handing it off to you for review.

The draft report now includes analysis of all questions, including the proportion of users who view the full newcomer tasks module on mobile. It also has similar measurements from the Variant A/B experiment analysis for comparison. Handing this task off to @MMiller_WMF for final review and sign off while I go review all the notebooks and put them on GitHub.

All the notebooks related to this analysis are now in the Growth homepage repository. The filenames all start with the task number of this phab task.