Page MenuHomePhabricator

Support delaying NotificationGetStartedJob differently based on user variant
Closed, ResolvedPublic3 Estimated Story Points

Description

Background

GrowthExperiments currently sends a Getting Started notification 48 hours after an user account is registered, assuming the user did not make enough(*) edits yet. This is technically implemented by a delayed job, using the jobReleaseTimestamp option. The delay is configurable in server configuration (GELevelingUpGetStartedNotificationSendAfterSeconds), but it is the same for all users at a given wiki.

(*) No suggested edits and less than GELevelingUpGetStartedMaxTotalEdits total edits.

Problem

In the parent task, we want to A/B test a shorter delay of the notification. To be able to do that, the job delay needs to support A/B testing.

Proposed solution

We can create a new user variant for this experiment, which would be factored for computing the delay. We can restructure the GELevelingUpGetStartedNotificationSendAfterSeconds option (currently an integer) by making it a map from user variant to delay (+default key when nothing matches). For example, the following configuration would delay the notification by 20 hours for users assigned to quick-gettingstarted and 48 hours for any other user.

$wgGELevelingUpGetStartedNotificationSendAfterSeconds = [
     'quick-gettingstarted' => 20 * 3600,
     'default' => 48 * 3600,
];

While at this, it would probably make sense to move the actual enqueueing call to a levelling-up specific place (LevelingUpManager?), instead of HomepageHooks (which is already quite overcrowded).

Potential challenges
  • No reserved variant names: technically, we could create an user variant called default. This would then clash with the default key proposed above. Given we already have a default variant (called control), I consider introduction of a similarly-named variant like default to be a relatively unlikely event. For this reason, I don't think we need to do anything special about this.
  • Variant querying is error-prone in onLocalUserCreated: The notification is enqueued within the onLocalUserCreated hook. Sometimes, accessing the central user IDs in this hook (which we would have to do) fails and returns a zero, which would break any bucketing we try to do here. See T380500: CentralAuthUser returning outdated data after user creation for more details.

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald Transcript
Potential challenges
  • No reserved variant names: technically, we could create an user variant called default. This would then clash with the default key proposed above. Given we already have a default variant (called control), I consider introduction of a similarly-named variant like default to be a relatively unlikely event. For this reason, I don't think we need to do anything special about this.

I'm confused about this item, could you clarify. Why would we not use the "control"-variant for the control group?

  • Variant querying is error-prone in onLocalUserCreated: The notification is enqueued within the onLocalUserCreated hook. Sometimes, accessing the central user IDs in this hook (which we would have to do) fails and returns a zero, which would break any bucketing we try to do here. See T380500: CentralAuthUser returning outdated data after user creation for more details.

Can't we do here what we did for the "surfacing structured tasks" experiment and user the local wiki-id? That seems to work fine - including the logging of the variant for analytics and Prometheus/Grafana, or am I missing something?

Potential challenges
  • No reserved variant names: technically, we could create an user variant called default. This would then clash with the default key proposed above. Given we already have a default variant (called control), I consider introduction of a similarly-named variant like default to be a relatively unlikely event. For this reason, I don't think we need to do anything special about this.

I'm confused about this item, could you clarify. Why would we not use the "control"-variant for the control group?

Sure! I suggested to design wgGELevelingUpGetStartedNotificationSendAfterSeconds as a list of special-cases for individual user variants. For all other variants, we would use the duration from a key that exists for that purpose (I suggested to call it default, but it can be called whatever).

The problem is that the variant name can be an arbitrary string. If we used the default key to provide the default delay (if the variant is not listed explicitly), it could clash with a potential variant that would be (also) called default. As mentioned in the description, I consider this to be extremely unlikely, but it is something that could happen.

An alternative to using a default key would be to enumerate all variants in the config (and error out if we cannot find the user's). I'd like to avoid doing that, as it would force us to update wgGELevelingUpGetStartedNotificationSendAfterSeconds every time we add a variant. This is probably not something we want to do.

  • Variant querying is error-prone in onLocalUserCreated: The notification is enqueued within the onLocalUserCreated hook. Sometimes, accessing the central user IDs in this hook (which we would have to do) fails and returns a zero, which would break any bucketing we try to do here. See T380500: CentralAuthUser returning outdated data after user creation for more details.

Can't we do here what we did for the "surfacing structured tasks" experiment and user the local wiki-id? That seems to work fine - including the logging of the variant for analytics and Prometheus/Grafana, or am I missing something?

Of course, that is a solution for this point.

Urbanecm_WMF set the point value for this task to 3.May 26 2025, 4:06 PM
Potential challenges
  • No reserved variant names: technically, we could create an user variant called default. This would then clash with the default key proposed above. Given we already have a default variant (called control), I consider introduction of a similarly-named variant like default to be a relatively unlikely event. For this reason, I don't think we need to do anything special about this.

I'm confused about this item, could you clarify. Why would we not use the "control"-variant for the control group?

Sure! I suggested to design wgGELevelingUpGetStartedNotificationSendAfterSeconds as a list of special-cases for individual user variants. For all other variants, we would use the duration from a key that exists for that purpose (I suggested to call it default, but it can be called whatever).

The problem is that the variant name can be an arbitrary string. If we used the default key to provide the default delay (if the variant is not listed explicitly), it could clash with a potential variant that would be (also) called default. As mentioned in the description, I consider this to be extremely unlikely, but it is something that could happen.

An alternative to using a default key would be to enumerate all variants in the config (and error out if we cannot find the user's). I'd like to avoid doing that, as it would force us to update wgGELevelingUpGetStartedNotificationSendAfterSeconds every time we add a variant. This is probably not something we want to do.

Makes sense, thank you for elaborating! I agree with your suggestion about both the general approach and about the suggested naming. I agree that the risk of a collision is very minor, and even in the unlikely event that we would introduce such a variant called default, then that would also only be a clash if that variant would then expect a delay that is not the default, which is probably negligible.

I think this question also highlights another important point: Once this experiment is over and we have settled on a new time, then we probably want to remove this scaffolding again, to prevent any unexpected future clashes.

KStoller-WMF lowered the priority of this task from High to Medium.May 27 2025, 4:49 PM

Change #1157556 had a related patch set uploaded (by Urbanecm; author: Urbanecm):

[mediawiki/extensions/GrowthExperiments@master] refactor: Schedule LevelingUp notifications from LevelingUpManager

https://gerrit.wikimedia.org/r/1157556

Change #1157597 had a related patch set uploaded (by Urbanecm; author: Urbanecm):

[mediawiki/extensions/GrowthExperiments@master] refactor(LevelingUpManagerTest): Clarify the purpose of $config

https://gerrit.wikimedia.org/r/1157597

Change #1157629 had a related patch set uploaded (by Urbanecm; author: Urbanecm):

[mediawiki/extensions/GrowthExperiments@master] feat(LevelingUpManager): Support different delays based on user variant

https://gerrit.wikimedia.org/r/1157629

Change #1157597 merged by jenkins-bot:

[mediawiki/extensions/GrowthExperiments@master] refactor(LevelingUpManagerTest): Clarify the purpose of $config

https://gerrit.wikimedia.org/r/1157597

Change #1157556 merged by jenkins-bot:

[mediawiki/extensions/GrowthExperiments@master] refactor: Schedule LevelingUp notifications from LevelingUpManager

https://gerrit.wikimedia.org/r/1157556

Change #1157629 merged by jenkins-bot:

[mediawiki/extensions/GrowthExperiments@master] feat(LevelingUpManager): Support different delays based on user variant

https://gerrit.wikimedia.org/r/1157629

Change #1163012 had a related patch set uploaded (by Urbanecm; author: Urbanecm):

[mediawiki/extensions/GrowthExperiments@wmf/1.45.0-wmf.6] Backport Getting Started notification code

https://gerrit.wikimedia.org/r/1163012

Change #1163012 merged by jenkins-bot:

[mediawiki/extensions/GrowthExperiments@wmf/1.45.0-wmf.6] Backport Getting Started notification code

https://gerrit.wikimedia.org/r/1163012

Mentioned in SAL (#wikimedia-operations) [2025-06-23T19:08:19Z] <urbanecm@deploy1003> Started scap sync-world: Backport for [[gerrit:1163012|Backport Getting Started notification code (T394957)]]

Mentioned in SAL (#wikimedia-operations) [2025-06-23T19:10:30Z] <urbanecm@deploy1003> urbanecm: Backport for [[gerrit:1163012|Backport Getting Started notification code (T394957)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there.

Mentioned in SAL (#wikimedia-operations) [2025-06-23T19:20:22Z] <urbanecm@deploy1003> Finished scap sync-world: Backport for [[gerrit:1163012|Backport Getting Started notification code (T394957)]] (duration: 12m 03s)