Page MenuHomePhabricator

Scale: power analysis for wiki expansion
Closed, ResolvedPublic

Description

As we scale up to more wikis and larger wikis, we'll have more data coming in from more newcomers. That may mean that we won't have to keep data around for as long in order to achieve statistical significance for our results.

In this task, we want to calculate our experimental power as we add more wikis on top of the ones we have, to get a sense of how long activation and retention experiments will need to run. It would also be good to know how long simpler experiments would need, such as experiments like T238888: Variant tests: "initiation" test (A vs. B), which are looking at a much more frequent activity than retention: clicking on a module on the homepage.

I think we should compare the power of these four groups of wikis:

  • Original target wikis: Czech, Korean, Arabic, Vietnamese
  • Current set: Czech, Korean, Arabic, Vietnamese, Ukrainian, Hungarian, Armenian, Basque
  • Adding just French: Czech, Korean, Arabic, Vietnamese, Ukrainian, Hungarian, Armenian, Basque, French
  • Adding our next set: Czech, Korean, Arabic, Vietnamese, Ukrainian, Hungarian, Armenian, Basque, French, Polish, Persian, Swedish, Danish, Indonesian, Italian, Portuguese.

Details

Due Date
Apr 21 2020, 7:00 AM

Event Timeline

We want to have a sense of these numbers by next week (Tues, Apr 21).

MMiller_WMF updated the task description. (Show Details)

This ended up taking a lot longer than expected, partly due to the current pandemic, and partly because the simulation code needed tweaking to run efficiently on stat1008. Once the latter was fixed it's been smooth sailing, but it still takes about a day to complete 3,000 simulations for 16 wikis (1,000 each for 2%, 5%, and 10% effect size).

The simulations used data from 2019 as the basis for setting simulation parameters. Like we've done before, we ran simulations for our interventions affecting user activation (editing within 24 hours after registrations), and user retention (editing on days 1–15, holding activation constant). As mentioned above, we simulate different amounts of effect size: +10%, +5%, and +2%. These are effect sizes, so for example +10% means that activation increases from 30% to 33%, and so on.

The code, datasets, simulation parameters, and graphs can all be found in this GitHub repository.

Activation results:

For activation, we simulate experiments running from 7 (one week) to 84 days (twelve weeks) in weekly increments.

10% effect on activation:

activation_power_analysis_10perc.png (1×2 px, 173 KB)

5% effect on activation:

activation_power_analysis_5perc.png (1×2 px, 205 KB)

2% effect on activation:

activation_power_analysis_2perc.png (1×2 px, 280 KB)

Summary: The results for 10% effect size shows that all scenarios gives us a lot of statistical power. With our current set of target wikis, we're at >= 90% statistical power (the typical target used in analysis) after 2 weeks. The results for 5% effect size are perhaps more interesting, where with our target wikis it takes us 5–6 weeks to reach 90% power. Adding French Wikipedia, we get there in 2, comfortably in 4. Lastly, we see that for 2% effect size, we can now reach 90% power in 12 weeks by adding French Wikipedia, or in 6–8 weeks by adding all 16.

Retention results:

For retention, we simulate experiments running from 1 to 6 months, in half-month increments. Note that in these simulations we hold activation constant, meaning that we assume that the intervention only has a positive effect on retention. If it also has a positive effect on activation, then our statistical power will be higher than the simulation suggests.

10% effect on retention:

retention_power_analysis_10perc.png (1×2 px, 180 KB)

5% effect on retention:

retention_power_analysis_5perc.png (1×2 px, 236 KB)

2% effect on retention:

retention_power_analysis_2perc.png (1×2 px, 265 KB)

Summary: For 10% effect size, we see that with our current set of target wikis we reach 90% statistical power in 2 months. The current set of 8 wikis gets us there a couple of weeks earlier, and adding French gets us there in a month. The 5% effect size results are again perhaps more interesting, With our target wikis, we can reach statistical power for that effect size in 6 months. Using all our current 8 wikis possibly gets us there a month sooner. Adding a large wiki like the French gives us the ability to do it in 3 months, and we can also see that additional wikis are better as all 16 likely reaches that mark in 1.5 months (and definitely in 2). Detecting a 2% increase in retention continues to be impossible in either scenario, though.

TL;DR:

Adding French Wikipedia has a large and positive influence on our statistical power, for example by enabling us to detect 5% increase in retention in 3 months. If we add all 16 proposed wikis, we're able to detect 2% increases in activation in the same timeframe. A 2% change in retention continues to be impossible to detect.

@MMiller_WMF : Have a look at the graphs and the summaries, and let me know what questions these leave you with.

@nettrom_WMF -- thank you. This has helped us refine our data retention plans.