Page MenuHomePhabricator

Track variant assignment on account creation
Closed, ResolvedPublic

Description

In order to facilitate the count of each group in the experiment, it's been requested to have a single stream to collect the information user_id-user_variant. Since GrowthExperiments does not have any instrument in place that collects such information (homepagemodule was used priorily for this), it's necessary to add it to some existing instrument or create a dedicated one.

Possible solutions for new accounts
For the surfacing experiment targeting new accounts, we'd want to track variant assigned at the time of account creation (similar to what we do with statslib metrics in our Growth team kpi dashboard, account creation section).

  1. [Discarded] Make use of the already approved and running serversideaccountcreation stream which can track a campaign parameter (see CampaignsSecondaryAuthenticationProvider.php#L79-L90) and use the user variant assigned as the campaign value, which should be later easy to use for counting new accounts on each group. The variant could be added from onAuthFormFieldsChange hook
    1. Discarded because how to feed such parameter "during" account creating is not clear, onAuthFormFieldsChange is not an option since it is too early in the process, the user id is not been created yet.
  2. Create a new dedicated stream to track an "experiment enrollment event on account creation", increment the metric using LocalUserCreated hook, along with the condition local-bucket-growth for user options.
  3. Re-use an available "account creation" interaction logger, maybe from T346327, although I'm not sure yet how the integration between Growth and WE would work or if it's a good idea

Possible solutions for existing accounts
For the surfacing experiment targeting existing accounts, we'd want to track variant assigned at the time of viewing a page. Ideally a page that has recommendations, but this is left for discussion, nice to have.

  1. Track user id and assigned variant in the BeforePageDisplayHook execution as an impression, this has the benefits of not adding extra JS payload to the response of "control" users, but the downsides of potentially counting impressions of users that wouldn't be elegible for the experiment for product or system constraints. eg: JS didn't load, link recommendations API did not return any results
  2. Track user id and assigned variant in the client using LinkSuggestionInteractionLogger, this has the benefit that only users who would be able to load the feature and interact are counted. It has the downside of a bigger response payload. (Could be optimized with some lazy loading?)

Acceptance criteria

  • Data analyst can query for counting the number of users enrolled on each experiment group on account creation
  • [Nice to have] Data analyst can query for counting the number of users enrolled on each experiment group on page impression (including new accounts)

Event Timeline

Sgs renamed this task from Track variant assignment in serversideaccountcreation stream to Track variant assignment on account creation.
Sgs triaged this task as High priority.
Sgs moved this task from Incoming to Doing on the Growth-Team (Current Sprint) board.
Sgs updated the task description. (Show Details)

Historically growth experiments have only target new account holders and automatically enabled the homepage on account creation. Also, in prior experiments (before conditional defaults existed), we used to store variant assignment on account creation in the user_properties table. That made analysis of variant assignments simpler. (see get treatment/control assignments as an example of querying user_properties table). That's no longer an option due to the long standing T54777 problem, so we need somewhere else to store this information and make it available for data scientists.

The Metrics Platform approach to this is not to enroll and assign a variant to a user on a given moment and store it in a MW table, as that would lean to similar scalability problems than T54777. Instead, the variant assigned is recalculated every time the system needs it and only stored in the analytics storage with the rest of event payload. In GrowthExperiments, since the introduction of conditional defaults to avoid storage problems, we were looking to use a similar approach, but we have never done it before. In the Community updates experiment we partially validated the new approach, we tested that the variants returned in the homepage were indeed distributed correctly but I think we only looked at user variant assignments on the homepagevisit/homepagemodule events.

The MP approach can be followed in GE for new accounts within the experiment, but it won't work for existing accounts. We lack a reliable "single point of interaction" from existing users that we could calculate and count its variant assigned, and only once.

Possible solutions for new accounts

For the surfacing experiment targeting new accounts, we'd want to track variant assigned at the time of account creation (similar to what we do with statslib metrics in our Growth team kpi dashboard, account creation section).

  1. Create a new dedicated stream to track an "experiment enrollment event on account creation", increment the metric using LocalUserCreated hook, along with the condition local-bucket-growth for user options.
  2. Re-use an available "account creation" interaction logger, maybe from T346327, although I'm not sure yet how the integration between Growth and WE would work or if it's a good idea

Possible solutions for existing accounts
TBD

Sgs updated the task description. (Show Details)

Thank you for looking into this and coming up with approaches!

Create a new dedicated stream to track an "experiment enrollment event on account creation", increment the metric using LocalUserCreated hook, along with the condition local-bucket-growth for user options.

I think that sounds good to me.

A similar event (that again only has user-id and variant, (and maybe "has homepage enabled"?)) could be something like "page view by eligible user" that would be fired in the surfacing onBeforePageView hook (we probably want to move the if-condition on the variant to later in the hook method), This should give us also a good sample of existing users (with 0 edits) in both treatment and control group.

A similar event (that again only has user-id and variant, (and maybe "has homepage enabled"?)) could be something like "page view by eligible user" that would be fired in the surfacing onBeforePageView hook (we probably want to move the if-condition on the variant to later in the hook method), This should give us also a good sample of existing users (with 0 edits) in both treatment and control group.

I'm working on building the event, however I was planning to fire it onLocalUserCreated as a way to ensure it only gets fired once for new accounts. I've been thinking about firing in it in BeforePageDisplayHookHandler, specially to put in scope existing accounts, but that would likely create duplicates and I'm not sure if that's fine. What do you think @Iflorez @Michael ?

A similar event (that again only has user-id and variant, (and maybe "has homepage enabled"?)) could be something like "page view by eligible user" that would be fired in the surfacing onBeforePageView hook (we probably want to move the if-condition on the variant to later in the hook method), This should give us also a good sample of existing users (with 0 edits) in both treatment and control group.

I'm working on building the event, however I was planning to fire it onLocalUserCreated as a way to ensure it only gets fired once for new accounts.

In that hook, can we tell new, automatic, and temp account creations apart?

I've been thinking about firing in it in BeforePageDisplayHookHandler, specially to put in scope existing accounts, but that would likely create duplicates and I'm not sure if that's fine. What do you think @Iflorez @Michael ?

@Iflorez can tell me that I'm talking rubbish, but I would assume that duplications are fine. Also, the Homepage schema fires every time that someone visits the homepage, right? So I expect there to be a simple way to get the unique values by column from a set of events. That being said, I don't actually know anything about how data-analytics works here at Wikimedia (or in general), so I might very well be wrong.

the Homepage schema fires every time that someone visits the homepage, right?

yes

I would assume that duplications are fine.

yes, during analysis I can select the first applicable entry per id. More discussion may be needed here to make sure were on the same page as far as what is firing and thus applicable.

A similar event (that again only has user-id and variant, (and maybe "has homepage enabled"?)) could be something like "page view by eligible user" that would be fired in the surfacing onBeforePageView hook (we probably want to move the if-condition on the variant to later in the hook method), This should give us also a good sample of existing users (with 0 edits) in both treatment and control group.

Yes, we need to gather user-id and variant. It appears we also need to consider the page at the time of assignment. Because not all pages are link_suggestion applicable/eligible or have link_suggestions, we need a robust assignment that considers whether the page had link needs and would have been served to the user if they hadn't been in the control. See scenario C in this doc. I'll discuss this with @KStoller-WMF this afternoon and can jump on a meeting today or tomorrow as helpful.

Change #1123605 had a related patch set uploaded (by Sergio Gimeno; author: Sergio Gimeno):

[mediawiki/extensions/GrowthExperiments@master] analytics(HomepageHooks): log experiment_enrollment interaction on new accounts

https://gerrit.wikimedia.org/r/1123605

Change #1123606 had a related patch set uploaded (by Sergio Gimeno; author: Sergio Gimeno):

[operations/mediawiki-config@master] beta: add mediawiki.product_metrics.growth_product_interaction

https://gerrit.wikimedia.org/r/1123606

Change #1123607 had a related patch set uploaded (by Sergio Gimeno; author: Sergio Gimeno):

[operations/mediawiki-config@master] [Growth] Add mediawiki.product_metrics.growth_product_interaction stream config

https://gerrit.wikimedia.org/r/1123607

Change #1123606 abandoned by Sergio Gimeno:

[operations/mediawiki-config@master] beta: add mediawiki.product_metrics.growth_product_interaction

Reason:

No need for beta specific config, squashed in next change

https://gerrit.wikimedia.org/r/1123606

In that hook, can we tell new, automatic, and temp account creations apart?

Yes, there's a check for this at the very beginning of the hook execution, so we're not counting this users, see HomepageHooks.php#794. Is this what you were expecting from the experiment assignment pov?

I've been thinking about firing in it in BeforePageDisplayHookHandler, specially to put in scope existing accounts, but that would likely create duplicates and I'm not sure if that's fine. What do you think @Iflorez @Michael ?

@Iflorez can tell me that I'm talking rubbish, but I would assume that duplications are fine. Also, the Homepage schema fires every time that someone visits the homepage, right? So I expect there to be a simple way to get the unique values by column from a set of events. That being said, I don't actually know anything about how data-analytics works here at Wikimedia (or in general), so I might very well be wrong.

I'm not sure if there's a lot of value in logging from BeforePageDisplayHookHandler runs as opposite of doing it with the existing LinkSuggestionInteractionLogger in the client from a data collection point of view. As I understand the "experiment specification" (I'm referring here to our shared understanding from meetings and @Iflorez 's document), the enrollment eligibility should not take in account users who are already not elegible by a product requirement. As an example, users visiting pages who are not in the main namespace. In that spirit, system constraints like not having recommendations for the article (seems related to @Iflorez last comment) or JS not loading fast enough, seem also experiment data noise. On the other hand if those "impressions" are counted in BeforePageDisplayHookHandler we definitely save some JS payload for many users in the "control" group. Restricting account age for the enrolled cohort seems very sensible just for this reason.

I think we're close to an understanding and solution that satisfies data collection patterns and manageable analysis. We just need an agreement on the point or points of interaction we want to capture. Feedback welcome, cc @Iflorez @Michael

Sgs updated the task description. (Show Details)
Sgs updated the task description. (Show Details)
Sgs updated the task description. (Show Details)

In that hook, can we tell new, automatic, and temp account creations apart?

Yes, there's a check for this at the very beginning of the hook execution, so we're not counting this users, see HomepageHooks.php#794. Is this what you were expecting from the experiment assignment pov?

I was confused by the documentation for the LocalUserCreated hook stating that <code>$autocreated</code> if CentralAuth is used, but maybe that is outdated, because the code of that hook handler clearly gets executed.

I've been thinking about firing in it in BeforePageDisplayHookHandler, specially to put in scope existing accounts, but that would likely create duplicates and I'm not sure if that's fine. What do you think @Iflorez @Michael ?

@Iflorez can tell me that I'm talking rubbish, but I would assume that duplications are fine. Also, the Homepage schema fires every time that someone visits the homepage, right? So I expect there to be a simple way to get the unique values by column from a set of events. That being said, I don't actually know anything about how data-analytics works here at Wikimedia (or in general), so I might very well be wrong.

I'm not sure if there's a lot of value in logging from BeforePageDisplayHookHandler runs as opposite of doing it with the existing LinkSuggestionInteractionLogger in the client from a data collection point of view. As I understand the "experiment specification" (I'm referring here to our shared understanding from meetings and @Iflorez 's document), the enrollment eligibility should not take in account users who are already not elegible by a product requirement. As an example, users visiting pages who are not in the main namespace. In that spirit, system constraints like not having recommendations for the article (seems related to @Iflorez last comment) or JS not loading fast enough, seem also experiment data noise. On the other hand if those "impressions" are counted in BeforePageDisplayHookHandler we definitely save some JS payload for many users in the "control" group. Restricting account age for the enrolled cohort seems very sensible just for this reason.

I think we're close to an understanding and solution that satisfies data collection patterns and manageable analysis. We just need an agreement on the point or points of interaction we want to capture. Feedback welcome, cc @Iflorez @Michael

I think we can collect it in that hook handler that decides whether the user get's the javascript at all: add a check for whether the current page has any suggestions, move the check for the variant to the very end, and then add the instrumentation just before the check for that variant. This should then be a close to 100% match to what we track in javascript. The missing fraction can then be estimated for the impression values from this event and from the one recorded in javascript.
The new GrowthInteractionLogger looks suitable for that, I think.

Change #1124360 had a related patch set uploaded (by Sergio Gimeno; author: Sergio Gimeno):

[mediawiki/extensions/GrowthExperiments@wmf/1.44.0-wmf.18] analytics(HomepageHooks,BeforePageDisplayHandler): log experiment_enrollment interaction on new accounts

https://gerrit.wikimedia.org/r/1124360

Change #1124362 had a related patch set uploaded (by Sergio Gimeno; author: Sergio Gimeno):

[mediawiki/extensions/GrowthExperiments@wmf/1.44.0-wmf.19] analytics(HomepageHooks,BeforePageDisplayHandler): log experiment_enrollment interaction on new accounts

https://gerrit.wikimedia.org/r/1124362

Change #1124360 merged by jenkins-bot:

[mediawiki/extensions/GrowthExperiments@wmf/1.44.0-wmf.18] analytics(HomepageHooks,BeforePageDisplayHandler): log experiment_enrollment interaction on new accounts

https://gerrit.wikimedia.org/r/1124360

Change #1123605 merged by jenkins-bot:

[mediawiki/extensions/GrowthExperiments@master] analytics(HomepageHooks,BeforePageDisplayHandler): log experiment_enrollment interaction on new accounts

https://gerrit.wikimedia.org/r/1123605

Change #1123607 merged by jenkins-bot:

[operations/mediawiki-config@master] [Growth] Add mediawiki.product_metrics.growth_product_interaction stream config

https://gerrit.wikimedia.org/r/1123607

Mentioned in SAL (#wikimedia-operations) [2025-03-04T09:00:56Z] <sgimeno@deploy2002> Started scap sync-world: Backport for [[gerrit:1123607|[Growth] Add mediawiki.product_metrics.growth_product_interaction stream config (T387286)]]

Mentioned in SAL (#wikimedia-operations) [2025-03-04T09:05:22Z] <sgimeno@deploy2002> sgimeno: Backport for [[gerrit:1123607|[Growth] Add mediawiki.product_metrics.growth_product_interaction stream config (T387286)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug)

Mentioned in SAL (#wikimedia-operations) [2025-03-04T09:16:57Z] <sgimeno@deploy2002> Finished scap sync-world: Backport for [[gerrit:1123607|[Growth] Add mediawiki.product_metrics.growth_product_interaction stream config (T387286)]] (duration: 16m 01s)

Change #1124362 merged by jenkins-bot:

[mediawiki/extensions/GrowthExperiments@wmf/1.44.0-wmf.19] analytics(HomepageHooks,BeforePageDisplayHandler): log experiment_enrollment interaction on new accounts

https://gerrit.wikimedia.org/r/1124362

Mentioned in SAL (#wikimedia-operations) [2025-03-04T09:20:38Z] <sgimeno@deploy2002> Started scap sync-world: Backport for [[gerrit:1124362|analytics(HomepageHooks,BeforePageDisplayHandler): log experiment_enrollment interaction on new accounts (T387286)]]

Mentioned in SAL (#wikimedia-operations) [2025-03-04T09:23:28Z] <sgimeno@deploy2002> sgimeno: Backport for [[gerrit:1124362|analytics(HomepageHooks,BeforePageDisplayHandler): log experiment_enrollment interaction on new accounts (T387286)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug)

Mentioned in SAL (#wikimedia-operations) [2025-03-04T09:32:39Z] <sgimeno@deploy2002> Finished scap sync-world: Backport for [[gerrit:1124362|analytics(HomepageHooks,BeforePageDisplayHandler): log experiment_enrollment interaction on new accounts (T387286)]] (duration: 12m 01s)

Change #1124473 had a related patch set uploaded (by Sergio Gimeno; author: Sergio Gimeno):

[mediawiki/extensions/GrowthExperiments@master] analytics(GrowthExperimentsInteractionLogger): add mediawiki.database to event data

https://gerrit.wikimedia.org/r/1124473

Change #1124473 merged by jenkins-bot:

[mediawiki/extensions/GrowthExperiments@master] analytics(GrowthExperimentsInteractionLogger): add mediawiki.database to event data

https://gerrit.wikimedia.org/r/1124473

Change #1124493 had a related patch set uploaded (by Sergio Gimeno; author: Sergio Gimeno):

[mediawiki/extensions/GrowthExperiments@wmf/1.44.0-wmf.18] analytics(GrowthExperimentsInteractionLogger): add mediawiki.database to event data

https://gerrit.wikimedia.org/r/1124493

Change #1124494 had a related patch set uploaded (by Sergio Gimeno; author: Sergio Gimeno):

[mediawiki/extensions/GrowthExperiments@wmf/1.44.0-wmf.19] analytics(GrowthExperimentsInteractionLogger): add mediawiki.database to event data

https://gerrit.wikimedia.org/r/1124494

Change #1124493 merged by jenkins-bot:

[mediawiki/extensions/GrowthExperiments@wmf/1.44.0-wmf.18] analytics(GrowthExperimentsInteractionLogger): add mediawiki.database to event data

https://gerrit.wikimedia.org/r/1124493

Change #1124494 merged by jenkins-bot:

[mediawiki/extensions/GrowthExperiments@wmf/1.44.0-wmf.19] analytics(GrowthExperimentsInteractionLogger): add mediawiki.database to event data

https://gerrit.wikimedia.org/r/1124494

Mentioned in SAL (#wikimedia-operations) [2025-03-04T21:28:04Z] <jforrester@deploy2002> Started scap sync-world: Backport for [[gerrit:1124449|fix(surfacing): don't show highlights on protected pages]], [[gerrit:1124451|fix(surfacing): don't show highlights on protected pages]], [[gerrit:1124493|analytics(GrowthExperimentsInteractionLogger): add mediawiki.database to event data (T387286)]], [[gerrit:1124494|analytics(GrowthExperimentsInteractionLogger): add mediawiki.database to ev

Mentioned in SAL (#wikimedia-operations) [2025-03-04T21:32:53Z] <jforrester@deploy2002> sgimeno, jforrester, migr: Backport for [[gerrit:1124449|fix(surfacing): don't show highlights on protected pages]], [[gerrit:1124451|fix(surfacing): don't show highlights on protected pages]], [[gerrit:1124493|analytics(GrowthExperimentsInteractionLogger): add mediawiki.database to event data (T387286)]], [[gerrit:1124494|analytics(GrowthExperimentsInteractionLogger): add mediawiki.database to

Mentioned in SAL (#wikimedia-operations) [2025-03-04T21:41:32Z] <jforrester@deploy2002> Finished scap sync-world: Backport for [[gerrit:1124449|fix(surfacing): don't show highlights on protected pages]], [[gerrit:1124451|fix(surfacing): don't show highlights on protected pages]], [[gerrit:1124493|analytics(GrowthExperimentsInteractionLogger): add mediawiki.database to event data (T387286)]], [[gerrit:1124494|analytics(GrowthExperimentsInteractionLogger): add mediawiki.database to e

The variant assignment is logged at the time of account creation through the mediawiki.product_metrics.growth_product_interaction stream. As an example of event:

{
  "http": {
    "request_headers": {
      "user-agent": ""
    }
  },
  "meta": {
    "domain": "pt.wikipedia.org",
    "stream": "mediawiki.product_metrics.growth_product_interaction",
    "id": "39b3fb81-7cd3-4d96-b968-2fe5eeb23835",
    "dt": "2025-03-04T22:15:55.897Z",
    "request_id": "23bf7682-7f06-4ab5-b6c9-4bbcf147721a",
    "topic": "codfw.mediawiki.product_metrics.growth_product_interaction",
    "partition": 0,
    "offset": 4413
  },
  "dt": "2025-03-04T22:15:55Z",
  "mediawiki": {
    "database": "ptwiki"
  },
  "$schema": "/analytics/product_metrics/web/base/1.3.0",
  "action": "experiment_enrollment",
  "agent": {
    "client_platform": "mediawiki_php"
  },
  "action_source": "LocalUserCreatedHook",
  "performer": {
    "id": 1
  },
  "experiments": {
    "enrolled": [
      "growth-experiments"
    ],
    "assigned": {
      "growth-experiments": "surfacing-structured-task"
    }
  }
}

For the variant assignment at the time of page impression of a page with recommendations, the events are logged through the same stream but with "action_source": "BeforePageDisplayHook" cc @Iflorez

Also the assignment can be observed in the Account creation section of the Growth KPIs dashboard.

@Sgs Thank you for the details.
I've updated the instrumentation spec to reflect the information you've noted.
The sample event and dashboard link are appreciated.