Page MenuHomePhabricator

Update Metrics Platform Client Libraries to accept experiment membership
Closed, ResolvedPublic5 Estimated Story Points

Description

T370880

In order to associate user interactions (event data) to experiments, we need a way of recording that membership data. We also need that way to support membership in multiple concurrent experiments. After evaluating two potential data modeling approaches based on complexity of analysis and implementation, ability to extend the model in the future, and performance considerations, we decided that membership should be recorded using the following data model:

experiments: struct<
  enrolled: array<string>,
  assigned: map<string, string>
  [, [property]: map<string, [type]>]
>

where:

  • enrolled is an array of names of all experiments that the subject is currently enrolled in
  • assigned is a map of which experiment groups (variants) the subject has been assigned to
    • the key is the experiment name present in the enrolled array
    • the value is the assigned experiment group (variant)
  • any additional properties would store additional data about the experiments using the same model as assigned, where the keys match the values found in the enrolled array.

More details in the decision record.

Technical details

Schema updates

A new experiment fragment must be created ( (according to the instructions above) ) to be added to the common one. After that, the new common fragment will be used to create a new version for base schemas (app and web). Integration with common and base fragments will be done at T366802: Update Metrics Platform Base Schemas to include instrument name

JS Client library
  • We need to create a new function to get the current user experiments membership to be able to add that information automatically to every produced event. The function must be created there where Metrics Platform is integrated, so we need to define it in the Integration interface and provide an implementation where needed. A default one for DefaultIntegration.js and specific ones for MediaWikiMetricsClientIntegration in the case of the JS client library. The new function will be used as a part of the submitInteracion() function to be able to add that information to every event row.
  • We also need to create another function that provides the assigned bucket value for a specific experiment for the current user. This function will be used by instrument developers to know which variant has been assigned to the current user and, that way, they will know what needs to be implemented for every case. Something like getExperimentBucketValue(experimentName) that returns the bucket value that is assigned for the current user. Also as a part of the integration with Metrics Platform, we should add this function to the same place as before
PHP Client library

TBD

Requirements

Related Objects

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

What if the user is enrolled in multiple, non-overlapping (hopefully) experiments that are running on the same page? Should we rather add an experiments property, which is an array of experiment objects? If so, how would that impact querying the data?

That's a great point!

I did some tests with two different ways to model this data: https://gist.github.com/bearloga/b0ca0b3ebd7ca427beeb68fe920a66e7 and what would Spark SQL and Presto queries look like in each case.

I'll check with my team if there's a strong preference. (Slack thread)

Is there a preference for version 1 or 2 from an engineering perspective?

Andrew brought up an excellent point (in the same thread) which is that https://wikitech.wikimedia.org/wiki/Event_Platform/Schemas/Guidelines#Complex_array_element_and_map_value_type_evolution_is_not_well_supported so if we did go the array-of-objects (version 1) route, we MAY want array<map<string,string>> not array<struct<id:int,name:string,group:string>> in case there is anything we would need to add later. Like, if we needed to include a new field for each experiment it'd be easy with the former and impossible with the latter.

@phuedx: I see in the Figma designs that instruments/experiments will have user-provided unique names with machine-readable versions auto-generated. Will those instruments/experiments have numeric IDs in the database that also uniquely identify them?

Maybe:

  • For MPIC-managed experiments we can record numeric ID and the unique name (to avoid joining against MPIC's table of experiments to get the name from ID)
  • For non-MPIC-managed experiments (e.g. the A/B test that Android team is planning) we would omit the numeric ID (there is none) and have the client fill in the experiment name

Question for @MNeisler & @nettrom_WMF: for MPIC-managed experiments, would you prefer the name as entered by the user in MPIC (which would be visible in MPIC) OR the machine-readable version generated by MPIC? Consider which version you might prefer to use in WHERE statements for selecting interaction data from a specific experiment:

WHERE experiment.name = 'Personalized search results (WE 3.1.5)'

-- OR --

WHERE experiment.name = 'personalized_search_results_we_3-1-5'

(Totally guessing at what the machine-readable version would look like.)

@phuedx: I see in the Figma designs that instruments/experiments will have user-provided unique names with machine-readable versions auto-generated. Will those instruments/experiments have numeric IDs in the database that also uniquely identify them?

Maybe:

  • For MPIC-managed experiments we can record numeric ID and the unique name (to avoid joining against MPIC's table of experiments to get the name from ID)
  • For non-MPIC-managed experiments (e.g. the A/B test that Android team is planning) we would omit the numeric ID (there is none) and have the client fill in the experiment name

Agreed. Name is required. ID is optional.

To your question to @MNeisler and @nettrom_WMF: Should we include both names here – the machine- and human-readable ones?

@phuedx: I reviewed the modeling approaches in PA team sharing (notes w/ link to transcript & recording) and Morten brought up a great point which is the computational cost of using virtual table generation (LATERAL VIEW EXPLODE() in Spark SQL and CROSS JOIN UNNEST() in Presto) for version 1 which reminded him of evaluating custom data approach of monoschema MP.

There's some preference for the neatness of querying with version 2 (which I agree with). I think the performance considerations about version 1 are very valid and both of these make version 2 a more favorable candidate at the moment. (But still waiting for more feedback from Megan & Morten before making the final selection.)

I don't think we need to invest time into benchmarking these unless there is a very strong preference for version 1 from engineering perspective and we need to demonstrate practical differences in query execution time when the dataset is, say, 100K events. (The hypothesis is that version 1 would be substantially slower / more expensive to query.)

To your question to @MNeisler and @nettrom_WMF: Should we include both names here – the machine- and human-readable ones?

I suppose the only issue (if you'd even want to even call it that) with #2 is that we would need to update the schema if we wanted to capture new information about experiments., e.g:

In the comment above, I suggested that we capture both machine- and human-readable names of an experiment. This would require the experiments struct be updated to

experiments: struct<
  enrolled: array<string>,
  assigned: map<string, string>,
  names: map<string, string>
>

I really don't see this (needing to update schemas) as a huge issue from an engineering perspective.

Collected various discussions & points into this decision brief (viewable by public). Will follow up here when the decision is made.

mpopov added a subscriber: VirginiaPoundstone.

@phuedx @VirginiaPoundstone: Okie dokie, I updated the task description with the data model specification and updated the requirements based on a guess of what is involved.

Sfaci set the point value for this task to 5.Aug 27 2024, 8:26 AM

Change #1067306 had a related patch set uploaded (by Santiago Faci; author: Santiago Faci):

[schemas/event/secondary@master] Adding a new 'experiments' fragment to collect data about which experiments is a subject enrolled in

https://gerrit.wikimedia.org/r/1067306

Change #1067306 merged by jenkins-bot:

[schemas/event/secondary@master] Adding a new 'experiments' fragment to collect data about which experiments is a subject enrolled in

https://gerrit.wikimedia.org/r/1067306

Sfaci updated the task description. (Show Details)

@VirginiaPoundstone is it ok to spin off the corresponding Java lib updates AC into another task for a later sprint? since we narrowed scope to MW as the primary use case for Growth's experiment, we're focused on the JS/PHP libraries for now.

Yes.

We may want to also include the rollback required for the char count when we do the Java library.

Change #1073484 had a related patch set uploaded (by Jennifer Ebe; author: Jennifer Ebe):

[mediawiki/extensions/EventLogging@master] T368326-update-metrics-platform-to-accept-experiment-membership

https://gerrit.wikimedia.org/r/1073484

Apart from the EventLogging change above, there is also a metrics-platform one ready for review: https://ggitlab.wikimedia.org/repos/data-engineering/metrics-platform/-/merge_requests/71

Change #1073484 had a related patch set uploaded (by Santiago Faci; author: Jennifer Ebe):

[mediawiki/extensions/EventLogging@master] T368326-update-metrics-platform-to-accept-experiment-membership

https://gerrit.wikimedia.org/r/1073484

Change #1073484 merged by jenkins-bot:

[mediawiki/extensions/EventLogging@master] T368326-update-metrics-platform-to-accept-experiment-membership

https://gerrit.wikimedia.org/r/1073484

Change #1080053 had a related patch set uploaded (by Santiago Faci; author: Santiago Faci):

[mediawiki/extensions/EventLogging@master] MediaWikiMetricsClientIntegration: quick fix for isCurrentUserEnrolled() function

https://gerrit.wikimedia.org/r/1080053

Change #1080053 merged by jenkins-bot:

[mediawiki/extensions/EventLogging@master] MediaWikiMetricsClientIntegration: quick fix for isCurrentUserEnrolled() function

https://gerrit.wikimedia.org/r/1080053

Change #1080068 had a related patch set uploaded (by Jennifer Ebe; author: Jennifer Ebe):

[mediawiki/extensions/EventLogging@master] Create-Tests-For-Metrics-Platform

https://gerrit.wikimedia.org/r/1080068

Change #1080068 merged by jenkins-bot:

[mediawiki/extensions/EventLogging@master] Create Tests: Add tests for MediaWikiMetricsClientIntegation#isCurrentUserEnrolled()

https://gerrit.wikimedia.org/r/1080068

It looks like the documentation updates specified in the task description haven't been done, assuming this change includes user-facing library changes