Page MenuHomePhabricator

[Revise Tone] Investigate observed impact on constructive activation
Closed, ResolvedPublic

Description

User story & summary:

As a stakeholder of the WE1 objective and outcomes from the Revise Tone Structured Task experiment, I want to understand whether the early impact to constructive activation is "real," (that is, not due to some error in the instrumentation or experiment setup) so that I can make an informed decision about next actions.

Background & research:

This task is important because we need to validate the observed impact on constructive activation and take action if necessary.

Constructive activation rate appears to be negatively impacted by revise tone:

image (6).png (289×702 px, 36 KB)

https://superset.wikimedia.org/superset/dashboard/p/D2EB53dNOyq/

  • Constructive activation rate: -32.7% (p value: <0.001)
  • Constructive activation rate (mobile web): -26.9% (p value: <0.001)

Control group and treatment group appear to be unbalanced, which could be a sign of enrollment/bucketing issues:

assignedsubject_count
control4679
treatment6711

Screenshot 2026-02-02 at 9.10.30 AM.png (452×1 px, 101 KB)

Acceptance Criteria:
  • Verification of findings surfaced in Slack thread
  • Growth team & steering committee alignment on what actions to take with the experiment (if anything)

Event Timeline

wiki_idassignedsubject_count
0arwikicontrol787
1arwikitreatment774
2enwikicontrol34215
3enwikitreatment36732
4frwikicontrol4875
5frwikitreatment5060
6ptwikicontrol1497
7ptwikitreatment1601
wiki_idassignedsubject_count
0arwikicontrol787
1arwikitreatment774
2enwikicontrol34215
3enwikitreatment36732
4frwikicontrol4875
5frwikitreatment5060
6ptwikicontrol1497
7ptwikitreatment1601

Thank you for that data! Though this rules out any relation to those strange "empty .experiment key" event validation errors that I looked into in context of T415580, because those did not seem to occur on enwiki at all.

Percentages of treatment in the table above:

  • arwiki: 49.58%
  • enwiki: 51.77%
  • frwiki: 50.93%
  • ptwiki: 51.68%

I have a theory about what might be going on:

Currently, we're recording the Homepage-suggestions-enabled event in PHP, if they visit the homepage with suggested edits activated:
https://gerrit.wikimedia.org/r/plugins/gitiles/mediawiki/extensions/GrowthExperiments/+/2b53763db294d489320716e5173239dce26999e2/includes/Specials/SpecialHomepage.php#137

However, that activation (optionally picking topics and optionally changing task types) is something they first need to do, typically on their first visit to the homepage. But that then leads to a divergence:

  • if a control group user arrives for their first visit to the homepage and activates suggested edits, but then never returns to the homepage, then Homepage-suggestions-enabled (and other experiment events) will never fire for them
  • if a treatment group user arrives for their first visit to the homepage, then it seems Revise-tone-shown would (often) fire immediately
    • this is probably a bug, because at this stage the suggested edits modules has not been activated yet, but the tasks have already been loaded with the default configuration in the background and that triggers the event being sent

So that would explain the likely source of divergence: control group users that never return to the homepage and a treatment-only event that fires too early.

This would also explain why the Revise Tone task has seemingly such bad constructive activation data: The control group users that never come back to the homepage and that we thus do not track would likely also not activate -> this makes the remaining control group users look better than they are.

This would also explain why in @mpopov's spreadsheet above there are treatment group users that have "seen" a Revise Tone task, but have seemingly never been to the homepage. OTOH, I'm unsure how to explain the treatment group users that have been to the homepage but have never seen a Revise Tone task. Not sure how that might have happened. (Maybe a "Do not track" directive and that is why we do not track events triggered in javascript, but do track the events triggered in PHP? No idea.)

Fixing this properly is the more tricky challenge.

Fixing this properly is the more tricky challenge.

Not wanting to open up the discussion of the measurement plan / instrumentation spec, but I can see a few ways how to move forward from here:

  1. just go with logging exposure when people create an account
    • pro: this is the most comprehensive form of measuring "constructive activation", and comparable to what we measured in previous experiments; easy to do
    • con: this deludes the effect with all sorts of influences, e.g. accounts created via the API (Apps) that never see the homepage (and thus never see revise tone tasks), etc.
  2. keep the existing Revise-tone-shown logging, but conversely also add a No-revise-tone-shown log in the same place, which should then record also for all the control group users
    • pro: relatively easy to do; same number of events for control and treatment; still relatively close to "real" exposure
    • con: still is logging "too early", before the Suggested Edits module even activated and so technically does not perfectly match exposure; another unexpected event
  3. only calculate constructive activation data from users with at least one Homepage-suggestions-enabled event
    • pro: no production code changes needed, only analytics code changes, we can use the data we already collected in this experiment
    • con: much smaller cohort (only maybe 1/6)
  4. actually figure out when the user activates suggested edits in the UI and also send the Homepage-suggestions-enabled event then. Or maybe a one-time Homepage-suggestions-activated event? Also figure out how to not send the Revise-tone-shown event before the Suggested Edits are actually activated during the first visit, but do send it at the moment when they do get activated and there is a Revise Tone task shown
    • pro: would be closest to what was originally envisioned, I think
    • con: complex, many interdependencies, high chance of making yet another mistake, not a quick fix

From my perspective any of those options seem reasonable, except perhaps number 4... which sounds complex and might still end up imperfect.

@mpopov or @MNeisler do either of you have a recommendation for how Growth should proceed?

@MNeisler and I discussed this and arrived at the following proposal:

  1. Remove Revise-tone-shown event entirely
    • We don't see a need for it and it's clearly messing up the analysis
  2. Replace action: page_visit, action_source: Homepage-suggestions-enabled event with a generic action: experiment_exposure event
    • This event is consistent with the upcoming platform improvement T414729: Add Experiment#sendExposure method to SDKs
    • This allows us to modify the constructive edit rate & constructive activation rate queries to only consider users who've been exposed to a variation, without using an event that is super specific to this experiment.
      • Keeps the metrics usable for analysis of other experiments
      • Makes results more accurate for this and other experiments
  3. After the change goes live (pending backports, deployment train delays, etc.), Experiment Platform engineers change the start & end dates in Test Kitchen's db
    • This is to make automated analytics ignore the data collected so far before the fix, yielding trustworthy results
    • Experiment is not interrupted – users who have been able to access Revise Tone continue being able to access Revise Tone

@KStoller-WMF @Michael: What do you think?

Sounds good to me as long as Michael agrees that this approach sounds reasonable.
Thank you @mpopov and @MNeisler for taking the time to discuss and propose a fix so quickly!

@MNeisler and I discussed this and arrived at the following proposal:

  1. Remove Revise-tone-shown event entirely
    • We don't see a need for it and it's clearly messing up the analysis

Done.

  1. Replace action: page_visit, action_source: Homepage-suggestions-enabled event with a generic action: experiment_exposure event
    • This event is consistent with the upcoming platform improvement T414729: Add Experiment#sendExposure method to SDKs
    • This allows us to modify the constructive edit rate & constructive activation rate queries to only consider users who've been exposed to a variation, without using an event that is super specific to this experiment.
      • Keeps the metrics usable for analysis of other experiments
      • Makes results more accurate for this and other experiments

There is still some nuance here about what exactly counts as an exposure. But based on Kirsten's general support above, I'll go for the users visiting the homepage when SuggestedEdits are enabled for the wiki. Please let me know if that is not sufficiently precise enough!

As an aside: is there general guidance about what acation_source should be in principle?

  1. After the change goes live (pending backports, deployment train delays, etc.), Experiment Platform engineers change the start & end dates in Test Kitchen's db
    • This is to make automated analytics ignore the data collected so far before the fix, yielding trustworthy results
    • Experiment is not interrupted – users who have been able to access Revise Tone continue being able to access Revise Tone

Thanks!

Change #1237223 had a related patch set uploaded (by Michael Große; author: Michael Große):

[mediawiki/extensions/GrowthExperiments@master] metrics(ReviseTone): send consistent experiment exposure event

https://gerrit.wikimedia.org/r/1237223

There is still some nuance here about what exactly counts as an exposure. But based on Kirsten's general support above, I'll go for the users visiting the homepage when SuggestedEdits are enabled for the wiki. Please let me know if that is not sufficiently precise enough!

But it's still only sent if user (in either control or treatment) visits the homepage and the module has been activated, right? Earlier you mentioned

if a control group user arrives for their first visit to the homepage and activates suggested edits, but then never returns to the homepage, then Homepage-suggestions-enabled (and other experiment events) will never fire for them

so I just want to confirm that we can expect this instrumentation behavior in both groups.

After the user activates suggested edits, does that trigger a page refresh? Asking because it's very important that we do have this exposure event when user is shown the (activated) module, which may have a Revise Tone task waiting for them if they're in treatment and a random task if they're in control.

As an aside: is there general guidance about what action_source should be in principle?

I just wrote some! https://wikitech.wikimedia.org/wiki/Test_Kitchen/Create_an_instrument#Guidelines_and_recommendations

Change #1237223 merged by jenkins-bot:

[mediawiki/extensions/GrowthExperiments@master] metrics(ReviseTone): send consistent experiment exposure event

https://gerrit.wikimedia.org/r/1237223

About https://gerrit.wikimedia.org/r/c/mediawiki/extensions/GrowthExperiments/+/1237223/3/includes/Specials/SpecialHomepage.php#137…

SuggestedEdits::isActivated( $this->getContext()->getUser()

seems more accurate for action: experiment_exposure event than

SuggestedEdits::isEnabledForAnyone( $this->wikiConfig )

We want the exposure event to only fire if the suggested edits module has been activated and contains either the control or the treatment experience. We don't want exposure event on a homepage visit where the module hasn't been activated and so there is no actual exposure.

Again, does the homepage reload/refresh after the module is activated? If it does, it might make sense to move this from server-side event to client-side, no?

Change #1237852 had a related patch set uploaded (by Phuedx; author: Michael Große):

[mediawiki/extensions/GrowthExperiments@wmf/1.46.0-wmf.14] metrics(ReviseTone): send consistent experiment exposure event

https://gerrit.wikimedia.org/r/1237852

Change #1237852 merged by jenkins-bot:

[mediawiki/extensions/GrowthExperiments@wmf/1.46.0-wmf.14] metrics(ReviseTone): send consistent experiment exposure event

https://gerrit.wikimedia.org/r/1237852

Mentioned in SAL (#wikimedia-operations) [2026-02-09T09:21:03Z] <phuedx@deploy2002> Started scap sync-world: Backport for [[gerrit:1237851|metrics(ReviseTone): Use Experiment::send to send metrics (T416612)]], [[gerrit:1237852|metrics(ReviseTone): send consistent experiment exposure event (T416199)]]

Mentioned in SAL (#wikimedia-operations) [2026-02-09T09:25:05Z] <phuedx@deploy2002> phuedx: Backport for [[gerrit:1237851|metrics(ReviseTone): Use Experiment::send to send metrics (T416612)]], [[gerrit:1237852|metrics(ReviseTone): send consistent experiment exposure event (T416199)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there.

Mentioned in SAL (#wikimedia-operations) [2026-02-09T09:48:37Z] <phuedx@deploy2002> Finished scap sync-world: Backport for [[gerrit:1237851|metrics(ReviseTone): Use Experiment::send to send metrics (T416612)]], [[gerrit:1237852|metrics(ReviseTone): send consistent experiment exposure event (T416199)]] (duration: 27m 34s)

@MNeisler: The new instrumentation is live. I was able to test some of the new queries from the MR, but will have to wait until later today when there's more data collected & processed.

Now we just need someone from Experiment Platform to update the start date for the experiment to 2026-02-10, which should be done tomorrow (2026-02-10) after 14:30 UTC to ensure continuity of user experience as we soft-reset the experiment.

Etonkovidova subscribed.

Checking the user workflow for triggering events, there are two points that look not quite clear

(1) on mobile a user only one task type on the initial Homepage screen. The task type is clearly indicated but it's recorded as
"taskTypes=;unavailableTaskTypes=;taskCount=2477;topics="

(2) a user just sees the Suggested edits topic selection, but the event is already triggered : edit "copyedit" on the article "Yugoslavia". But if users leave Homepage at this stage and never activate Suggested Edits module, they probably are still counted as users who saw SE cards.

Screenshot 2026-02-23 at 4.58.00 PM.png (747×1 px, 284 KB)