Page MenuHomePhabricator

Personalized first day: experiments (Variation A)
Closed, ResolvedPublic

Description

While it may not be a priority to experiment on the content or presentation of the new survey form, we will likely want to experiment on the presence of the form, and whether it depresses or increases activation rates.

The first step here is to decide whether and what we'll be experimenting on with this feature. Then we can create additional tasks for designing and implementing the experiments.

This task has become about experiments relating to Variation A.

A separate task has been created for the experiments relating to Variation C: T210868

Event Timeline

@nettrom_WMF -- in discussions with the team, it sounds like the main thing we want to know in the near term is simply whether the additional form has an effect on activation rate (or other important new account holder behavior). Learning about differences in question wording, ordering, quantity, or UI is not a priority right now.

Could you spend a few minutes thinking this through from your perspective? Basically, I think we don't want to deploy this to all new editors, and then suspect it might be depressing activation, but not really know for sure. In that vein, if we randomize who receives the form, we could see which of them make edits -- a path that requires a month-long (or longer) experiment, according to your calculations.

Perhaps another way to look at it is whether the form causes users to just bounce from the site. In that case, we would see via "Understanding first day" that they do not have any more pageviews.

Anyway, what do you think? Do you see a simple/straightforward way to get at what we want?

mpopov triaged this task as Medium priority.Oct 26 2018, 6:38 PM
mpopov moved this task from Triage to Doing on the Product-Analytics board.

The main question being asked here is whether the survey we are adding has a detrimental effect on user activity. I've discussed this with @MMiller_WMF, and also with the Product-Analytics team. Out of those meetings comes the following recommendations and questions:

  1. What are our leading indicators?
  2. Write down a set of potential outcomes, with a plan of action for each.
  3. How much are we willing to sacrifice?
  4. What do we know so far about the effects of the survey?

Regarding the first part: activation rate might not be the leading indicator we are looking for, instead we should consider measuring the proportion of new accounts that skip the survey, and perhaps also the proportion of new accounts that abandon the site upon encountering the survey. These might provide us with faster indications than what activation rate can.

That being said, we should figure out the potential outcomes from deploying the survey, and make a plan of action for each outcome. This goes together with question 3, we should know what our limit is, if there is one. Say we deploy it and find that 10% of users abandon the process (meaning they leave the site) after encountering the survey, are we willing to run with that for a month and then decide whether to take any action? If not, we should know that before deploying it, or in other words that the plan of action for "strong indications of 10% abandonment" is "we stop the survey".

Lastly, what do we know about the effects? The user testing that @RHo did suggests that the users consider this a low cost survey. To me that translates to a low risk of abandonment, and also a low probability of a user skipping the survey. We can use that information to make a more informed prior in any statistical analysis we might do to inform our decisions (but said statistical analysis requires an investment in analytics resources).

Another related topic is what the %-split for treatment (survey) and control should be in deployment. If the goal of this is to learn whether it has an effect on user activity, we might consider a 50/50 split. But, if the goal is to learn from new users, we want to deploy it to a higher proportion, maybe 80/20?

To sum up, the following action items comes out of this:

  1. Decide on leading indicators.
  2. Sketch a list of scenarios and what action we will take for each of them.
  3. Decide how long we are willing to run this even though we might see indications of negative impact.
  4. Decide on a %-split of survey/control users.

Hi @nettrom_WMF - I agree that looking at how many people complete vs skip the survey and then abandon the site is a good indicator. Other ideas that might be useful consider:

  • How many people who skip the survey to go directly to trying to edit a page? (Does not matter if they successfully complete the edit.) This may be a better indicator of whether people don't want to complete the survey than complete abandonment of the site.
  • How many people click on "Getting started with editing" links (Tutorial and Help Desk) from within the survey (either on the RHS panel or on the post-submission page) regardless of survey completion?
    • What are the activation rates for those users?
  • Splits on Activation rates or edit attempts for those who completed the survey based on answers to Q1 & Q2... Would it be interesting to see whether those who originally created an account just to read or didn't know Wikipedia was editable end up trying to an edit after exposure to this survey?

We wrote up our experiment plan and put it under our team pages on mw.org.

In summary, the primary goal is to measure how the survey affects editor activation rate to determine if it has a negative impact. We will accomplish that by deploying it for a month with an A/B test where 50% of new users see the survey, while the other 50% do not. Assignment to survey/control groups is done randomly, ref T206371.

We also proposed several leading indicators of negative effects in the experiment plan, together with specific courses of action.

Now that Variation C is almost ready, we should modify the written experiment plan to indicate what share of users should receive Variation A, Variation C, and no survey. That can be up to @nettrom_WMF and @SBisson.

We decided to create a separate task for experiments with Variation C: T210868

MMiller_WMF renamed this task from Personalized first day: experiments to Personalized first day: experiments (Variation A).Nov 30 2018, 6:26 PM
MMiller_WMF updated the task description. (Show Details)

We've completed our initial experiment and found no obvious detrimental effect from the survey. We've also run a second experiment against variation C, and found that Var A is preferable. Currently, we are running an experiment on Vietnamese Wikipedia with Var A and a control group, to learn more about the abandonment rate on that wiki (ref T216668 and T216669).

@nettrom_WMF -- do you think enough time has gone by that we can look at the abandonment rates? Even if we don't yet have statistical significance on the activation rate? I would like us to find out as soon as we can whether the survey seems to be causing the stark abandonment rate that the Var A vs. Var C experiment suggested there might be.

I started this analysis on 2019-03-18, at which point we had 3,624 non-autocreated registrations since switching on the survey/control A/B test. Using the week of data prior to deployment I had earlier estimated the overall abandonment rate at 17.2%. A power analysis indicated that if the control group's abandonment rate equalled the estimate, we would be able to determine a significant change if the survey group's abandonment rate was outside the [13%,21%] range.

I calculated abandonment overall for each group, as well as split by whether the account was registered on the desktop or mobile site. Overall, the results are:

GroupDid not abandon%Did abandon%
Control1,48984.3%27815.7%
Survey1,17863.4%67936.6%

Overall, the survey group has a significantly larger abandonment rate (i.e. the difference is outside the range indicated by our power analysis). However, this is driven by abandonment of registrations on the desktop site:

Desktop/mobileGroupDid not abandon%Did abandon%
DesktopControl1,06282.7%22217.3%
DesktopSurvey74354.3%62545.7%
MobileControl42788.4%5611.6%
MobileSurvey43589.0%5411.0%

The 0.6pp difference between the control and survey group on mobile is obviously not significant (it's a much smaller sample and the difference is much smaller than identified in our power analysis). In other words, the survey appears to have no significant effect on abandonment for users who registered on the mobile site.

The 28.4pp difference between the control and survey group on desktop is statistically significant (X^2=244.4, df=1, p << 0.001). We're discussing how to dig further into this to understand what's going on.

Leaving this in progress because the next step is to get these findings on mediawiki.org. This is not urgent.

kzimmerman lowered the priority of this task from Medium to Low.Sep 11 2019, 9:48 PM
kzimmerman subscribed.

Remaining documentation moved to a separate task, closing the analysis as resolved.