Page MenuHomePhabricator

MinT for Readers: A/B test analytics support
Open, MediumPublic2 Estimated Story Points

Description

Task to steps to be done analytics side for the MinT for Readers experiment

Steps involved:

  • MinT for Readers schema should be updated with the experiment fragment (T396900)
  • MinT for Readers A/A test (already started)
  • Register the experiment and the metrics to experiment-analytics-configs
    • experiments registry
    • metrics catalog (query template for automated analytics)
  • Launch the experiment

Event Timeline

Power analysis to determine sample size requirements

You might need to go without this because you don't have any baseline measurements of the primary metric (reader retention), so the power analysis here would be even more guesswork than usual.

NOTE: You also don't have accurate numbers on visitors to the target wikis, so if you do your power analysis and determine that you need X many subjects, it's hard to translate that X to a duration because there's no rate.

Some options:

  • Simulations. Evaluate a bunch of different plausible baselines with, say, 5% relative change and see what do they generally tell you about the scale of the sample size you need.
  • To get a visitor rate (to translate total X subjects into an experiment duration), you could count total subject counts from T397138: Run a second synthetic A/A test (0.1% of English Wikipedia) for each day and compare those counts with daily unique devices for that period, to see how close the experiment counts are to 0.1% of daily unique devices. Obviously they won't be exact, but if they're somewhat similar then unique devices could be an okay proxy metric.
  • (Recommended) You could try a small scale A/A test on the same target wikis just to obtain all the necessary measurements to do a power analysis.
  • You could conduct the experiment with some reasonable parameters – e.g. 2 weeks, max allowed traffic allocation (0.1% on English Wikipedia, 10% everywhere else) and then perform a post-hoc power analysis:
library(tidyverse)
library(pwr)

# assumptions: a 10% lift from 13% retention rate
pre_analyis <- pwr.2p.test(
    h = ES.h(p1 = 0.143, p2 = 0.13),
    sig.level = 0.05, power = 0.9, alternative = "greater"
)
2 * pre_analyis$n # 23.9K needed total (assumes same sample sizes ~11.9K/group)

sample_sizes <- list(
    very_underpowered = 12e3,
    underpowered = 20e3,
    powered = 24e3,
    overpowered = 30e3
)

set.seed(42)
experiment_data <- sample_sizes |>
    map(function(sample_size) {
        n_group = sample_size / 2
        outcomes_control <- rbinom(n = 1, size = n_group, prob = 0.13)
        outcomes_treatment <- rbinom(n = 1, size = n_group, prob = 0.143)
        tibble(
            n_total = sample_size,
            p_control = outcomes_control / n_group,
            p_treatment = outcomes_treatment / n_group
        )
    }) |>
    bind_rows(.id = "scenario")

estimate_power <- function(n, p_c) {
    # could we detect a 10% lift in observed baseline given the sample size we obtained?
    pwr.2p.test(
        h = ES.h(1.1 * p_c, p_c), 
        n = n / 2,
        sig.level = 0.05, alternative = "greater"
    )$power
}

experiment_data |>
    mutate(estimated_power = estimate_power(n_total, p_control))
scenarion_totalp_controlp_treatmentestimated_power
very_underpowered120000.1260.1460.653
underpowered200000.1230.1440.830
powered240000.1290.1490.900
overpowered300000.1280.1520.946

The estimation of power after the fact does not care about the treatment effect you observed. You can change outcomes_treatment to rbinom(n = 1, size = n_group, prob = 1.01 * 0.13) (1% lift) and when you do your NHST the results would be not significant across the board. That's because the power analysis is for detecting a 10% lift from 13% baseline. To detect a 1% lift from 13% baseline with 90% power you would actually need a total sample size of 2.2M subjects.

@KCVelaga_WMF: Given potential delays with the MinT for Readers experiment, there is a great opportunity to do a small scale A/A test on the same wikis to gather:

  • Baseline retention rate to inform power analysis and yield an estimate of desired sample size
  • Rate of user traffic to inform duration based on desired sample size

@mpopov thank you so much for taking the time to give input on this and detailed comparison of various options! Whenever I would have started this, I would have hit these issues and most likely spent a lot of time figuring out. You saved me a ton of time :) I was initially thinking to start unique devices, make some assumptions, and then go from there. As you said that is a lot of guesswork.

As you said, given we have some time now, I also think it is a really good opportunity to conduct a small scale A/A test. Another benefit I also see with this is, the LPL engineers and me getting familiar with Experimentation Platform and Edge Uniques before the actual experiment.

For an A/A test, jotting down some broad steps/considerations:

  • Implement the page-visit instrument (just a simple page_visited event): T397600
  • Test is for the same intended population as the actual experiment i.e. mobile web readers (logged-out) on the 13 pilots wikis.
    • Randomization will be based on Edge Uniques identifier into two groups.
  • Calculate baseline retention rate for both the groups (the expectation is that we shouldn't see any difference).
    • We can use automated analytics for this one.
  • I am thinking this should be a separate stream, not be part of main one. This way they would be clearly separated and this stream can be removed after A/A.

A couple of additional questions:

  • If we want to 1-week as return window for calculating retention, then the A/A test would have to run for 2 weeks.
    • We can reduce further, but I am thinking 1 week to cover for daily seasonality.
  • With automated analytics MVP, I am thinking it would be helpful to calculate retention rate for a different return window. Not a lot additional, but it would be good to check for 3-day retention as well.
    • At the moment, we also don't much insight into what is good return window to start with, and this will be helpful for that before the actual experiment starts.
  • I am thinking this should be a separate stream, not be part of main one. This way they would be clearly separated and this stream can be removed after A/A.

Oooh, yeah you will need to do this for a variety of reasons. Right now the stream is configured with these contextual attributes:

'provide_values' => [
	'mediawiki_database',
	'mediawiki_site_content_language',
	'mediawiki_site_content_language_variant',
	'page_content_language',
	'agent_client_platform',
	'agent_client_platform_family',
	'performer_session_id',
	'performer_active_browsing_session_token',
	'performer_name',
	'performer_is_bot',
	'performer_is_logged_in',
	'performer_edit_count_bucket',
	'performer_groups',
	'performer_registration_dt',
	'performer_is_temp',
	'performer_language',
	'performer_language_variant',
	'performer_pageview_id',
],

And performer_name with edge unique-based subject ID with user agent info… that's not a great combination. Plus it's not enabled for edge uniques.

So yes, you will need to configure a separate stream mediawiki.product_metrics.translation_mint_for_readers.experiments with a different set of contextual attributes:

'provide_values' => [
	'mediawiki_database',
+	`mediawiki_skin`,
	'mediawiki_site_content_language',
	'mediawiki_site_content_language_variant',
	'page_content_language',
	'agent_client_platform',
	'agent_client_platform_family',
	'performer_session_id',
	'performer_active_browsing_session_token',
-	'performer_name',
-	'performer_is_bot',
	'performer_is_logged_in',
-	'performer_edit_count_bucket',
-	'performer_groups',
-	'performer_registration_dt',
	'performer_is_temp',
	'performer_language',
	'performer_language_variant',
	'performer_pageview_id',
],

And enable edge uniques and opt out of UA collection (like we do with the base stream):

'eventgate' => [
	'enrich_fields_from_http_headers' => [
		// Don't collect the user agent
		'http.request_headers.user-agent' => false,
	],
	'use_edge_uniques' => true,
],

A couple of additional questions:

  • If we want to 1-week as return window for calculating retention, then the A/A test would have to run for 2 weeks.
    • We can reduce further, but I am thinking 1 week to cover for daily seasonality.

Hm… How long is the cohort window? If it's 1 week, the retention length is 1 week, and the return window is 1 week, then it would have to be 3 weeks. (I think?)

  • With automated analytics MVP, I am thinking it would be helpful to calculate retention rate for a different return window. Not a lot additional, but it would be good to check for 3-day retention as well.
    • At the moment, we also don't much insight into what is good return window to start with, and this will be helpful for that before the actual experiment starts.

Sure, good idea! I think @jwang was planning on something similar for her FY25/26 WE 3.1.5 hypothesis. Feel free to define multiple retention rates and use all in both the A/A and the A/B analysis. Just be sure to clearly differentiate them by name, for example:

  • Visitor retention (1-week)
  • Visitor retention (3-day)
  • Visitor retention (1-day)

For data stewardship details maybe something like:

business_data_steward: >
  Primary - Jennifer Wang (Product Analytics);
  Secondary - Krishna Chaitanya Velaga (Product Analytics)
technical_data_steward: >
  Abijeet Patro (Language and Product Localization) - MinT for Readers instrumentation

And, importantly, the main stream is also allowlisted for event sanitization https://gerrit.wikimedia.org/r/plugins/gitiles/analytics/refinery/+/refs/heads/master/static_data/sanitization/event_sanitized_analytics_allowlist.yaml

So yes, one more reason to split into a separate stream that's used exclusively for experimentation.

Hm… How long is the cohort window? If it's 1 week, the retention length is 1 week, and the return window is 1 week, then it would have to be 3 weeks. (I think?)

Oh yea, there's cohort window! It is 3 days, so that will be 7 + 7 + 3 = 17 days in total.

Change #1179120 had a related patch set uploaded (by Huei Tan; author: Huei Tan):

[operations/mediawiki-config@master] Add Metrics Platform stream configuration and registration for MinT for Wikipedia Readers Page visit instrumentation for experiment by Language and Product Localization team.

https://gerrit.wikimedia.org/r/1179120

Change #1179120 merged by jenkins-bot:

[operations/mediawiki-config@master] MinT: Add stream configuration and registration

https://gerrit.wikimedia.org/r/1179120

Mentioned in SAL (#wikimedia-operations) [2025-08-20T07:06:31Z] <kartik@deploy1003> Started scap sync-world: Backport for [[gerrit:1179120|MinT: Add stream configuration and registration (T397600 T397043)]]

Mentioned in SAL (#wikimedia-operations) [2025-08-20T07:08:39Z] <kartik@deploy1003> kartik, hueitan: Backport for [[gerrit:1179120|MinT: Add stream configuration and registration (T397600 T397043)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there.

Mentioned in SAL (#wikimedia-operations) [2025-08-20T07:17:52Z] <kartik@deploy1003> Finished scap sync-world: Backport for [[gerrit:1179120|MinT: Add stream configuration and registration (T397600 T397043)]] (duration: 11m 21s)

KCVelaga_WMF renamed this task from MinT for Readers: pre-experiment analytics setup to MinT for Readers: A/B test analytics support.Oct 8 2025, 1:01 PM
KCVelaga_WMF updated the task description. (Show Details)
mpopov triaged this task as Medium priority.
mpopov set the point value for this task to 2.
mpopov moved this task from Doing to Blocked on the Product-Analytics (Kanban) board.

Blocked until movement comms gives OK to proceed with the experiment