Page MenuHomePhabricator

[WE 1.2] Measure collective impact on constructive activation
Open, MediumPublic

Description

This task involves the work of evaluating the collective impact of WE 1.2 hypotheses have had on constructive activation.

WE 1.2

Constructive Activation: Widespread deployment of interventions shown to collectively cause a 10% relative increase (y-o-y) on mobile web and a 25% relative increase (y-o-y) on iOS of newcomers who publish ≥1 constructive edit in the main namespace on a mobile device, as measured by controlled experiments. | source

Requirements

@MNeisler to draft

Open question(s)

  • 1. Which "Approach" will we move forward with for evaluating success of WE 1.2 and why?

Approach #3:

On a per platform basis, we will calculate the proportion of interventions we deployed and evaluated through controlled experiments that met and/or exceeded the ≥10% (mobile web) and ≥25% (iOS) constructive activation targets we set at the outset of this year.

In an effort to both incentivize teams to be bold while at the same time being supported if/when an intervention doesn't deliver the impact we intend, we'll consider ourselves as having been effective if >70% of the year's interventions meet or exceed the constructive activation improvement targets defined above.

Approaches

To evaluate the aggregate impact of the discrete interventions we've deployed throughout the 2024-2025 fiscal year, we're considering the following approaches...

  • Approach #1: Sum impact of interventions measured through controlled experiments
    • In this approach, we would sum the impacts each intervention was proven [via a controlled experiment] to have on constructive activation, on a per platform (iOS and mobile web) basis
    • E.g. "Success" would mean the following: (SUM(Intervention #1, Intervention #2, Intervention #3)) ≥ 10% or 25% impact on constructive activation (on mobile web and iOS respectively).
  • Approach #2: Average impact of interventions measured through controlled experiments
    • In this approach, we would average the impacts each intervention was proven [via a controlled experiment] to have on constructive activation, on a per platform (iOS and mobile web) basis
    • E.g. "Success" would mean the following: (AVERAGE(Intervention #1 impact, Intervention #2 impact, Intervention #3 impact)) ≥ 10% or 25% impact on constructive activation (on mobile web and iOS respectively).
  • Approach #3: Count of interventions shown to impact constructive activation, measured through controlled experiments
    • In this approach, we would count the interventions proven [via a controlled experiment] to move constructive activation by at least the targets we set for each platform (≥10% = mobile web; 25% = iOS)
    • E.g. "Success" would mean the following: 100% of the interventions deployed were shown to cause a ≥10% or ≥25% impact on constructive activation on mobile web and iOS respectively

Background

In T375926#10300625 (November 2024), we converged on [i][ii] measuring the impact of WE 1.2 (2024-2025) by completing, "...a year-over-year comparison to measure the collective impact of these widely deployed interventions."

Revisiting this decision now, in April 2025, we're questioning whether this year-over-year measurement of the aggregate constructive activation rate (per platform) is viable. This question is prompted, in large part, by:

  1. Recognizing WE 1.2 interventions are not yet fully scaled to all newcomers at all wikis
  2. Appreciating that evaluating constructive activation using an impact analysis creates the potential for external factors (outside of our control) that could cause shifts in this metric.
  3. Accepting that running a controlled experiment to evaluate the collective impact of discrete interventions extends beyond the resources/capacity available to us

i. See discussion in Slack
ii. Also see Decision Log/WE 1.2 FY 24-25

Related Objects

Event Timeline

mpopov moved this task from Triage to Upcoming Quarter on the Product-Analytics board.

To be done at the end of Q4

MNeisler triaged this task as Medium priority.Apr 11 2025, 3:48 PM
MNeisler moved this task from Upcoming Quarter to Current Quarter on the Product-Analytics board.

Per today's offline discussion, I'm assigning this task over to @MNeisler to review the === Approaches outlined in the task description and propose additional approach(es) should viable ones exist.

HI @ppelberg - @MNeisler and I discussed the possible approaches today, here are notes from our discussion:

  • We should gather all the experiments and results so far before we decide on approach
  • Keep Apps separate from Mobile Web completely if we take any approach other than 3 since sample sizes are very small
  • If we use Approach 2 and average Mobile Web results we would likely need to weigh averages, also depends on sample sizes, that will need to be determined
  • Since we have determined that comparing results without a control limits our options, if we want to expedite this comparison and focus on how we do it next time Approach 3 seems to be most expeditious.

In our discussion we covered what experiments/results we have to work with, we should make sure this list is complete:
Growth has 2 (see next comment below)
Editing has 1 in progress
Apps has 1 (iOS Alt Text, with another[[ https://phabricator.wikimedia.org/T391997 | Activity Tab ]] releasing this week.

Growth has 1

Growth has two WE 1.2 related experiments:

Initial constructive activation data has been shared, and a final published report (that include retention data) is in progress:

Experiment is running, and @Iflorez will start analysis soon:

Decided

Per offline discussion with contributors to/members of WE 1.2...

To evaluate the aggregate impact of the discrete interventions we've deployed throughout the 2024-2025 fiscal year, we're going to do the following:

On a per platform basis, we will calculate the proportion of interventions we deployed and evaluated through controlled experiments that met and/or exceeded the ≥10% (mobile web) and ≥25% (iOS) constructive activation targets we set at the outset of this year.

In an effort to both incentivize teams to be bold while at the same time being supported if/when an intervention doesn't deliver the impact we intend, we'll consider ourselves as having been effective if >70% of the year's interventions meet or exceed the constructive activation improvement targets defined above.

Thinking
In moving forward with this approach we are acknowledging the following:

  1. Ideally, we'd measure the collective impact of the discrete intervention we deploy through a year-end controlled experiment, as @MNeisler described in T375926#10300625
  2. At present, running a controlled experiment of the sort "1." describes is not feasible for reasons that include:
    1. Interventions are deployed at different wikis and at different times
    2. The time and effort required to set up an experiment of this sort would mean results would be available outside of the window when they would be most useful (to inform annual planning)

Moving forward
In service of moving towards a future where we can evaluate the collective impact we cause with more confidence, next year we will explore the possibility of the following:

  1. Deploying interventions to the same wikis
  2. Using "holdout groups" so that we can maintain a global control group we can compare to a global test group at year's end. See T392959.