Page MenuHomePhabricator

Surfacing "Add a link" Structured Tasks: Set up A/B Test
Closed, ResolvedPublic1 Estimated Story Points

Description

User story & summary:

As the Growth team Product Manager, I want to A/B test the surfacing the "Add a link" task on pilot wikis, so that we can better understand the impact on newcomers and constructive activation.

Project Details
Experiment details:

Surfacing "Add a link" Structured Tasks Measurement Plan

Experiment Group: Receive access to experiment (AKA they may see an "Add an link" suggestion in Read mode if they visit an article with a suggestion)
Control Group: Will NOT have access to suggestions in Read mode, but may still have access to "Add a link" suggestions on the Homepage

Pilot wikis:
After an initial alpha test (T379976), we will complete a more robust A/B test on pilot wikis.
The Pilot Wikis included in this experiment are:

  • eswiki
  • frwiki
  • arzwiki
  • ruwiki
  • ptwikki
  • fawiki
  • idwiki
Acceptance Criteria:
  • Sample size: 100% of logged in accounts with zero edits. 50% in control and 50% in experiment.
  • Complete A/B test set-up and release the experiment to a testing environment agreed-up with QA.

The actual deployment to pilot wikis is covered by: T385343: Surfacing "Add a link" Structured Tasks: Experiment Release (FY24/25 WE1.2.9)

Event Timeline

Restricted Application added a subscriber: Huji. · View Herald TranscriptFeb 7 2025, 7:05 PM
KStoller-WMF moved this task from Inbox to Backlog on the Growth-Team board.
KStoller-WMF set the point value for this task to 0.5.
KStoller-WMF changed the point value for this task from 0.5 to 1.

The description of the sample taken from the measurement plan doc states Sample size: 100% of logged in accounts with zero edits. 50% in control and 50% in experiment. It's not possible to create such conditions with our current system. We can target zero edit accounts for the treatment group but we cannot ensure both groups will have the same population since control group is used as the default variant regardless of edits the account has. Is that ok from a data analysis pov? cc @nettrom_WMF @Iflorez

Change #1119536 had a related patch set uploaded (by Sergio Gimeno; author: Sergio Gimeno):

[mediawiki/extensions/GrowthExperiments@master] feat(SurfacingStructuredTasks): create surfacing-structured-task experiment variant

https://gerrit.wikimedia.org/r/1119536

Change #1119537 had a related patch set uploaded (by Sergio Gimeno; author: Sergio Gimeno):

[operations/mediawiki-config@master] beta: A/B test setup for surfacing structured tasks

https://gerrit.wikimedia.org/r/1119537

Change #1119536 merged by jenkins-bot:

[mediawiki/extensions/GrowthExperiments@master] feat(SurfacingStructuredTasks): create surfacing-structured-task experiment variant

https://gerrit.wikimedia.org/r/1119536

Change #1119537 merged by jenkins-bot:

[operations/mediawiki-config@master] cswiki beta: A/B test setup for surfacing structured tasks

https://gerrit.wikimedia.org/r/1119537

Mentioned in SAL (#wikimedia-operations) [2025-02-18T14:05:14Z] <urbanecm@deploy2002> Started scap sync-world: Backport for [[gerrit:1087539|Deploy DiscussionTools visual enhancements to top 10 wikis (exc. enwiki, ruwiki & zhwiki) (T379102)]], [[gerrit:1119537|cswiki beta: A/B test setup for surfacing structured tasks (T385903)]]

Mentioned in SAL (#wikimedia-operations) [2025-02-18T14:09:58Z] <urbanecm@deploy2002> urbanecm, esanders, sgimeno: Backport for [[gerrit:1087539|Deploy DiscussionTools visual enhancements to top 10 wikis (exc. enwiki, ruwiki & zhwiki) (T379102)]], [[gerrit:1119537|cswiki beta: A/B test setup for surfacing structured tasks (T385903)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug)

Mentioned in SAL (#wikimedia-operations) [2025-02-18T14:21:44Z] <urbanecm@deploy2002> Finished scap sync-world: Backport for [[gerrit:1087539|Deploy DiscussionTools visual enhancements to top 10 wikis (exc. enwiki, ruwiki & zhwiki) (T379102)]], [[gerrit:1119537|cswiki beta: A/B test setup for surfacing structured tasks (T385903)]] (duration: 16m 29s)

The description of the sample taken from the measurement plan doc states Sample size: 100% of logged in accounts with zero edits. 50% in control and 50% in experiment. It's not possible to create such conditions with our current system. We can target zero edit accounts for the treatment group but we cannot ensure both groups will have the same population since control group is used as the default variant regardless of edits the account has. Is that ok from a data analysis pov? cc @nettrom_WMF @Iflorez

We discussed this topic in the 02/18/25 PA shared consultation hour, see notes in the shared notes doc, as well as on this Slack post.

Preferred approach, if a user is not already enrolled:

  1. Determine if a user is eligible for enrollment in the experiment. In the case of the Surfacing Structured Tasks experiment, that means asking "do they have zero edits?", and only if the answer is "yes" do we continue.
  2. Use some kind of randomization algorithm to assign them to a group, which in this case is either the Treatment or Control group. Store that somewhere sensible.

As far as existing accounts, we can restrict to just accounts that have the Homepage enabled and with accounts newer than when HP launched.

Also, the team is considering making two experiments:

  • New accounts (user created in the last 24 hours) only
  • Existing accounts only

And generally, we may consider an MP feature request: target by edit count AND target by account-age bucket

Preferred approach, if a user is not already enrolled:

  1. Determine if a user is eligible for enrollment in the experiment. In the case of the Surfacing Structured Tasks experiment, that means asking "do they have zero edits?", and only if the answer is "yes" do we continue.

Due to data storage restrictions (related to T54777), we implemented the "existing accounts enrollment" (T376266) computing user state on the fly. That means we cannot keep a variant assigned to a user based on a piece of information that will change within the experiment, eg: edit count. At that point the user would be moved to the control variant. In practice that should only be a problem for the control group population and it can be mitigated by using edit_count = 0 when querying for the control group interactions.

  1. Use some kind of randomization algorithm to assign them to a group, which in this case is either the Treatment or Control group. Store that somewhere sensible.

The randomization exists but we don't store the dice roll. We roll the dice every time the user interacts with our application obtaining the same result (unless conditions are changed).

As far as existing accounts, we can restrict to just accounts that have the Homepage enabled and with accounts newer than when HP launched.

Limiting to homepage enabled seems reasonable, although the contrary could be an opportunity to do a sub-experiment on how many homepage activations do we get with such an experiment?

Also, the team is considering making two experiments:

  • New accounts (user created in the last 24 hours) only
  • Existing accounts only

The main constraint our experiment system has is the "control" group is used as a fallback variant to any on-going experiment. So interactions for users in the control group with more than 0 edits are expected in other interfaces, for instance events in mediawiki_structured_task_article_link_suggestion_interaction or readmode_page. We can only guarantee the interactions in readmode_suggestion_dialog will be for users with 0 edits (both variants).

And generally, we may consider an MP feature request: target by edit count AND target by account-age bucket

+1

Since the feature is now available in testwiki we should be able to refine the query from 1018 to validate a CTR calculation or "Edit through". Does this help @Iflorez?

Hola @Sgs, a few follow up questions:

  1. Use some kind of randomization algorithm to assign them to a group, which in this case is either the Treatment or Control group. Store that somewhere sensible.

The randomization exists but we don't store the dice roll. We roll the dice every time the user interacts with our application obtaining the same result (unless conditions are changed).

What do you mean by "...obtaining the same result...." ?

As far as existing accounts, we can restrict to just accounts that have the Homepage enabled and with accounts newer than when HP launched.

Limiting to homepage enabled seems reasonable, although the contrary could be an opportunity to do a sub-experiment on how many homepage activations do we get with such an experiment?

Perhaps @KStoller-WMF has a response here.
From a data analysis perspective, for the task at hand, limiting is preferred.

Also, the team is considering making two experiments:

  • New accounts (user created in the last 24 hours) only
  • Existing accounts only

The main constraint our experiment system has is the "control" group is used as a fallback variant to any on-going experiment. So interactions for users in the control group with more than 0 edits are expected in other interfaces, for instance events in mediawiki_structured_task_article_link_suggestion_interaction or readmode_page. We can only guarantee the interactions in readmode_suggestion_dialog will be for users with 0 edits (both variants).

Can you say more here? Do you mean that there are limitations and so you are no longer considering two experiments (existing, new editors)?

Limiting to homepage enabled seems reasonable, although the contrary could be an opportunity to do a sub-experiment on how many homepage activations do we get with such an experiment?

Perhaps @KStoller-WMF has a response here.

Limiting to homepage enabled seems logical if we decide we can support testing existing accounts in this experiment. I think we need to aim to simplify this experiment so analysis doesn't get too complex.

Etonkovidova updated the task description. (Show Details)