Page MenuHomePhabricator

Decide which bucketing/variant assignment system should we use
Closed, ResolvedPublic

Description

The scope of the task is to decide between different alternatives for assigning feature variants to users.

Context: the challenge is the current GrowthExperiments experiment tooling is only set up for newly created accounts (the variant assignment is done in the onLocalUserCreated hook and stores a user property growthexperiments-homepage-variant ). For the CommunityUpdates module experiment we want to target also existing accounts but as things are in GE that is a non-go because of the impact it could cause to the user_properties table, T54777 user_properties table bloat. The ultimate goal is to re-use tooling provided by the Metrics Platform (T373406 Add experiment enrollment functionality to the Metrics Platform extension and T368163 ). However that work is still in progress and might not get in time for the Sprinthackular week.

Option 0: continue making use of the existing variant assignment system in GrowthExperiments which will create user property rows (around 12K see sample bounds T369908: Estimate Community Updates module experiment sample size and Sample Size estimate doc). Before running the experiment the Growth team will get rid of old unnecessary rows of the property 'growthexperiments-homepage-variant' in the experiment target wikis arwikiand eswiki. Removing all “control” rows and setting this value as the user default should free a much higher number of rows than the experiment will create. See eswiki numbers:

wikiadmin2023@10.64.0.47(eswiki)> select count(*) from user_properties where up_property='growthexperiments-homepage-variant';
+----------+
| count(*) |
+----------+
|   544967 |
+----------+
1 row in set (0.562 sec)

wikiadmin2023@10.64.0.47(eswiki)> select count(*) from user_properties where up_property='growthexperiments-homepage-variant' and up_value='control';
+----------+
| count(*) |
+----------+
|   464608 |
+----------+
1 row in set (13.518 sec)

Option 1: make use of $wgConditionalUserOptions and adding a new condition like CUDCOND_BUCKET, which would statically and randomly divide users into variants based on their ID (user would be considered to be within variant Y if user_id % X == Y, where X and Y are parameters). This would prevent new rows from being created unless the value is changed for a user after being assigned.

Option 2: avoid storing anything in the user property table by assigning feature variants with the tooling provided in T373406: Add experiment enrollment functionality to the Metrics Platform extension. Similar to Option 1 the user_id is used to create a uniform distributed hash and transformed into a number used for bucketing. The trade-off of not storing the variant assigned is it needs to be re-computed on each page render. However the computational cost is cheap.

Event Timeline

@Sgs: Assuming this task is about the GrowthExperiments-Homepage codebase, hence adding that project tag so other people who don't know or don't care about some shortlived internal WMF stuff can also find this task when searching via projects or looking at workboards. Please set appropriate project tags when possible. Thanks!

@Sgs: Assuming this task is about the GrowthExperiments-Homepage codebase, hence adding that project tag so other people who don't know or don't care about some shortlived internal WMF stuff can also find this task when searching via projects or looking at workboards. Please set appropriate project tags when possible. Thanks!

Thank you 🙏 I'll keep it in mind. (For context it is hard to decide the codebase tag for this task as it has multiple repositories in scope, but the homepage is definitely a good enough as the entry point).

Per discussion with @Urbanecm_WMF and @phuedx the agreement is that the option with less friction and clearer path for the scope of the WMF-SDS 2 Sprinthackular 2024 is Option (0) as Option (1) and Option (2) require broader consensus and the remaining work goes beyond the sprint timeline. @phuedx will investigate forward how option 1 can influence the design of the ongoing work for option 2 in T373406

Meeting notes transcript

Pros and Cons

MU: If we add a different solution in GE we will have two competing systems. From the GE pov, doing something on top of user properties would be the easiest. Between storing data in DB and doing the computation in CD (which is in core). Doing this in core would be more tricky because the algorithm now lives in an extension that can not be called from core.
SS: agreeing entirely with above. From Data Products, the desire of using Option (2) is that it also includes logged out users. In the interest of getting something ship, looking for the smallest friction option.
MU: I wonder how Option (2) deals with the anon user case. Because afaik, the problem with anon is not bucketing but CDNs.
SS: the algorithm can be implemented in JS too and run it in the browser (is cheap). Features delivered by JS can be experimented using option 2 as the bucketing system. The CDN has the idea of unique user agent. The algorithm can also be run in the CDN relatively cheap without hitting app servers to do the computation. Option 2 is desirable but out of scope.
SS: is there any abstraction in GE on what is doing the experiments?
MU: we have an experiment user manager. The problem of that manager is it assumes at all times you only have one experiment. The EM needs to be updated to do multi experiment.
MU: if we are to make EMS aware of MetricsPlatform. One capability is to be able to set the variant which is something we call several times and would need to be changed.
SS: that is correct, the MP-way would be read-only. We are aware that writing the variant is a requirement. Worked before with some teams that needed the same requirement and solved it in ad-hoc way.
SS: there’s a good point from MU on the QA use case, it’s important

Conclusions
MU: building a variant assignment system that does not build on top of user properties. But I understand the problem of anons the MP is trying to solve. However that problem seems to go away with temporary account. Not sure about how many experiments on readers we run. From a strictly Growth pov I don’t see the benefit as changing for 1 experiment does not seem worth.
MU: I wonder if it would make sense for MP to allow extensions to implement their own bucketing algorithm.
SS: a take away from this is that there should be a mechanism to after the bucketing has been done, do something else.
SS: there’s an overlap between CD and Data products and bucketing systems. I’d be satisfied with an experiment that does not use the MP bucketing system.
SS: One thing I’d like to try is to try a PoC patch of the bucketing algorithm integrate with the ConditionalDefaults code. One thing becomes obvious is that doing bucketing in an extension is tricky because of how to call them and integrate. I’m thinking of providing some interface in core that would then be able to be used broadly.
MU: will review next week (after Sept 24)

While I know a decision has been made, I wanted to add a few notes about option 1 in case someone comes back to this in the future. From the current task description:

Option 1: make use of $wgConditionalUserOptions and adding a new condition like CUDCOND_BUCKET, which would statically and randomly divide users into variants based on their ID (user would be considered to be within variant Y if user_id % X == Y, where X and Y are parameters). This would prevent new rows from being created unless the value is changed for a user after being assigned.

This has been done at least once before, and we considered something similar when initially designing the approach for GrowthExperiments. I know of one experiment where user_id % 2 was used, and someone afterwards complained that it wasn't really random. From my perspective it's random in the sense of "one specific permutation of random", but it also forces consecutive IDs into buckets in a consistent manner, whereas in a random selection you'd expect some consecutive IDs to land in the same bucket.

We also considered an approach where if the last digit of the ID was <x we'd assign it to one group, and >=x assign it to the other group. This would also result in consecutive IDs landing in a specific group, and it would also be problematic for smaller wikis in that registration during certain times of day are more (or less) likely to land in a specific group, which is not what we want. So in the end with went with coin flips instead, as far as I remember.

What I have recommended, at least for new experiments is to use consistent hashing of user id and avoid storing anything in user_property. Just do it on the fly. e.g. if user id is 123, and you want to target 20%, hash("123" + "experiment name") and then modulo 5, take the user. This is a quite common method in A/B testing. You don't even need to use conditional defaults.

What I have recommended, at least for new experiments is to use consistent hashing of user id and avoid storing anything in user_property. Just do it on the fly. e.g. if user id is 123, and you want to target 20%, hash("123" + "experiment name") and then modulo 5, take the user. This is a quite common method in A/B testing. You don't even need to use conditional defaults.

While this is a good system in general, it makes it impossible to adjust buckets for specific users manually. In Growth's experience, this has been useful in several scenarios:

  • Development and QA: Currently, people are able to run ge.utils.setUserVariant('control') in console and switch between buckets as they please. This means we can avoid creating dozens of accounts, hoping you get into the "right" bucket (or worse, trying to change the user ID).
  • Outreach events: In the past, we've been asked by affiliates to make it possible to provide single experience for all attendees of their event. We fulfilled that need with geForceVariant, which is a GET parameter for Special:CreateAccount. This need is especially important if the organisers actively talk about the A/B tested feature at the event (it is nearly impossible to do if half attendees do not have the feature available).

Using conditional defaults makes it possible to keep the best from both approaches. For the vast majority of users, we calculate the bucket on the fly (for example, using the method you mentioned). For the edge cases where that is not advisable for one reason or another, we still retain the possibility to manually move users around (and only in those cases we actually use any storage).

You can easily define an override user_property that would be only written if there is an explicit override done by QA or events.

You can easily define an override user_property that would be only written if there is an explicit override done by QA or events.

Or even just a cookie (maybe even a session cookie).

You can easily define an override user_property that would be only written if there is an explicit override done by QA or events.

Isn't that basically the same as using conditional defaults for this though? I'm failing to see the practical difference between using conditional defaults and building the override proposal you mentioned. In both cases, I need to make a database query when determining the bucket (to see if an appropriate user_properties row exists) and in both cases, the bucket is determined on-the-fly when there is no row. But, maybe I'm missing something obvious here, I'd like to hear more.

Or even just a cookie (maybe even a session cookie).

Cookie would probably work for developers and QA engineers, but for event participants, this would suddenly change their experience at some point (possibly even when they return home and log in at their own computer).

You can easily define an override user_property that would be only written if there is an explicit override done by QA or events.

Isn't that basically the same as using conditional defaults for this though? I'm failing to see the practical difference between using conditional defaults and building the override proposal you mentioned. In both cases, I need to make a database query when determining the bucket (to see if an appropriate user_properties row exists) and in both cases, the bucket is determined on-the-fly when there is no row. But, maybe I'm missing something obvious here, I'd like to hear more.

You can cache that user_property look up in memcached and they are cheap. Even if you end up doing a lot of reads, it's much better than writing the rows and still doing the read again.

What I have recommended, at least for new experiments is to use consistent hashing of user id and avoid storing anything in user_property. Just do it on the fly. e.g. if user id is 123, and you want to target 20%, hash("123" + "experiment name") and then modulo 5, take the user. This is a quite common method in A/B testing. You don't even need to use conditional defaults.

Thanks for the recommendation, this is fact the approach that T373406: Add experiment enrollment functionality to the Metrics Platform extension implements, the user assigned bucket is always computed on the fly and it only gets written in the "analytics storage" for querying.

...

While this is a good system in general, it makes it impossible to adjust buckets for specific users manually. In Growth's experience, this has been useful in several scenarios:

  • Development and QA: Currently, people are able to run ge.utils.setUserVariant('control') in console and switch between buckets as they please. This means we can avoid creating dozens of accounts, hoping you get into the "right" bucket (or worse, trying to change the user ID).

This is a relevant use case but not a blocker for GrowthExperiments to adopt the proposed bucketing from T373406. In conversations with @phuedx we talked about mechanisms to achieve this. It would require to override whatever MP variant initially assigns with a URL parameter value or/and JS utility for setting the value in client storage. But seems doable. Should we file a task for this requirement under T370880: [EPIC] FY 24/25 SDS 2.1.7 | Alpha Release of Instrument Configuration System (MPIC) to not forget it? cc @phuedx

  • Outreach events: In the past, we've been asked by affiliates to make it possible to provide single experience for all attendees of their event. We fulfilled that need with geForceVariant, which is a GET parameter for Special:CreateAccount. This need is especially important if the organisers actively talk about the A/B tested feature at the event (it is nearly impossible to do if half attendees do not have the feature available).

Using conditional defaults makes it possible to keep the best from both approaches. For the vast majority of users, we calculate the bucket on the fly (for example, using the method you mentioned). For the edge cases where that is not advisable for one reason or another, we still retain the possibility to manually move users around (and only in those cases we actually use any storage).

While I understand why I campaign organizer would want to customize the experience of their attendees, doing that through an experiment seems in conflict with the definition of experiment itself, in which we want the treatment group to be equally treated. Mixing metrics from users that receive in-person support with users that don't seems not good for result analysis. In order to prevent a divergent experience in a campaign event, attendees shouldn't be enrolled to any experiment in general. I'm not sure if an experimentation platform should be used to create customized experiences for groups of users. The use case for programmatically enrolling users into a treatment group remains open on the MetricsPlaform side. On the GrowthExperiments side, we could provide the same functionality for campaigns as we do now by overriding the enrolled experiments and variants assigned when a user registers through a campaign. What do you think? @Urbanecm_WMF @phuedx

Per discussions with Growth engineers, the experiment and variant assignment manager used for the Community updates feature will remain to be the GrowthExperiments one. That is because adopting the MP variant assignment as it is now would introduce more tech debt than benefits. In particular there are features that the existing manager provides that we would like to be able to use with the MP manager. These are:

  • Allow to (force) set an experiment and experiment variant for a given user.
    • Use case 1: GrowthExperiments provides a convenience JS utility to the team QA testers so they can test all variants in beta and teswiki before we actually start the experiment in target wikis, eg: ge.utils.setUserVariant('somevariant').
    • Use case 2: Growth team has conducted experiments targeting a particular group of users, for instance the attendees to an offline campaign event. In this scenario, GE provides a way for event organizers to ensure all the attendees that register within the event get the same experiment variant assigned. This way a consistent experience during the event is ensured and the Growth team can experiment with a particular group of newcomers. Currently this is done via a query parameter in the create account page, eg: Special:CreateAccount?campaign=some-campaign&geForceVariant=some-experiment-variant
  • Allow to target only new accounts, existing accounts, existing accounts of X age
    • Use case: Growth team is heavily focused on newcomers, hence the need to run some experiments including only newly registered accounts. On the other hand, targeting existing accounts is also desired for other experiments. The ability to configure this would be ideal.
  • Per platform variant sampling rate
    • Growth variant assignment system can be configured on a per-platform basis. In this context platform refers to our desktop and mobile site. This is because each platform receives different traffic and the experiment may want to adjust for this. But it could also be because a feature is primarily designed for one platform or the other. An example of this config in GE could be:
'control' => [
  'mobile' => 30,
  'desktop' => 50,
],
'some-variant' => [
    'mobile' => 70,
    'desktop' => 50
]
  • Nice to have: UI dashboard to create experiments. This is not a current feature GE has but definitely a very interesting one that would make the adoption of MP experiment management very interesting for the team. Currently, setting up an experiment in GE requires quite some manual engineering work. Aside from creating the instrument(s), the experiment variants need to be set in PHP constants, then the config needs to updated and backported, which is what determines the experiment start. It would be ideal if a PM or data scientist could start an experiment with the least engineering effort through some dashboard.

Let me know if this feedback is good enough for Product-Analytics cc @phuedx @VirginiaPoundstone @WDoranWMF. I'm happy to elaborate more and discuss further about each feature/requirement.

  • Allow to (force) set an experiment and experiment variant for a given user.
    • Use case 1: GrowthExperiments provides a convenience JS utility to the team QA testers so they can test all variants in beta and teswiki before we actually start the experiment in target wikis, eg: ge.utils.setUserVariant('somevariant').
    • Use case 2: Growth team has conducted experiments targeting a particular group of users, for instance the attendees to an offline campaign event. In this scenario, GE provides a way for event organizers to ensure all the attendees that register within the event get the same experiment variant assigned. This way a consistent experience during the event is ensured and the Growth team can experiment with a particular group of newcomers. Currently this is done via a query parameter in the create account page, eg: Special:CreateAccount?campaign=some-campaign&geForceVariant=some-experiment-variant

Covered in T375900: Allow users to override experiment enrollment. I'll update that task to cover documenting that the overrides are temporary (at least for logged-out users).

  • Allow to target only new accounts, existing accounts, existing accounts of X age
    • Use case: Growth team is heavily focused on newcomers, hence the need to run some experiments including only newly registered accounts. On the other hand, targeting existing accounts is also desired for other experiments. The ability to configure this would be ideal.
  • Per platform variant sampling rate
    • Growth variant assignment system can be configured on a per-platform basis. In this context platform refers to our desktop and mobile site. This is because each platform receives different traffic and the experiment may want to adjust for this. But it could also be because a feature is primarily designed for one platform or the other. An example of this config in GE could be:
'control' => [
  'mobile' => 30,
  'desktop' => 50,
],
'some-variant' => [
    'mobile' => 70,
    'desktop' => 50
]

Custom segmentation is something that we're thinking about right now and will be exploring in the near future. We're monitoring Growth's improvements to the Condition User Defaults mechanism as well.

  • Nice to have: UI dashboard to create experiments. This is not a current feature GE has but definitely a very interesting one that would make the adoption of MP experiment management very interesting for the team. Currently, setting up an experiment in GE requires quite some manual engineering work. Aside from creating the instrument(s), the experiment variants need to be set in PHP constants, then the config needs to updated and backported, which is what determines the experiment start. It would be ideal if a PM or data scientist could start an experiment with the least engineering effort through some dashboard.

We're currently working on this 🎉 You can see an early version of this at https://mpic-next.wikimedia.org/.

Custom segmentation is something that we're thinking about right now and will be exploring in the near future. We're monitoring Growth's improvements to the Condition User Defaults mechanism as well.

Further conversations with @nettrom_WMF have revealed that the reason to distribute per-platform was done in experiments where the feature under evaluation was only available in one platform. I believe in this scenario the usage of bucket distribution to prevent users to get enrolled in an experiment in one platform is not ideal. If a feature is not available in one platform, there should be a feature flag preventing to show it rather than depending on users being in the control group. In conclusion, I believe this particular feature request is invalid. Otoh @nettrom_WMF suggested that the ideal enrollment mechanism would allow to only assign a variant to users that meet some criteria, eg: visit the Homepage. I believe this is possible but it is responsibility of the consumers of Metrics Platform, rather than something that is built-in at this point. But I'm curious to hear your thoughts @phuedx

We're currently working on this 🎉 You can see an early version of this at https://mpic-next.wikimedia.org/.

Cool <3, could you point to the phab task/epic to track that work. Ty!