Page MenuHomePhabricator

MPIC (aka EPIC): Create a plan for dogfooding the alpha release
Closed, ResolvedPublic5 Estimated Story Points

Description

T370880

Description

Create a dogfooding plan for the MPIC alpha release.

The goal is to test and have confidence in MP's CTR instrument and experiment using MPIC for instrument and experiment configuration, from within a MediaWiki extension, to sample and bucket users in an A/B test in order to send events with instrument and experiment tracking data to MP's base web stream and eventual destination in MP's monotable in Hive.

While some parts may be tested in isolation and via automation (i.e. testing API endpoints), the plan must include an approach for end-to-end testing from producer to consumer of the data.

For the MPIC prototype, we used an existing instrument's production stream config to test MP's overriding functionality using a ToolForge instance of a MediaWiki installation.

Questions / Considerations:

  • Do we want to test an actual instrument and see it send data to Hive? to the beta stream? Yes
  • Will a ToolForge instance suffice? No - we want to test in a production environment before the first product team uses the Experiment Platform apparatus
  • Do we deploy to TestWiki with production stream and producer config?
  • Do we create a fake experiment with MP's CTR instrument with production config and send dummy data to Hive? to beta stream?
  • What are the metrics/scope for success?
    • a validated event? does scoping event to the beta stream suffice?
    • data in Hive? do we send test data to a production table?

Technical Notes

See the Proposed Changes/Additions section of the MPIC alpha implementation plan to understand interdependent parts of the system that will enable a simple A/B test:

EPIC (fka MPIC)

frontend - UI for instrument and experiment configuration

backend - API for querying instrument and experiment configs:

Code repo - https://gitlab.wikimedia.org/repos/data-engineering/mpic

Metrics Platform extension (rename TBD - T381285: Create work plan for renaming of Metrics Platform to Experimentation Lab)

Function:

  • Fetches instrument and experiment configs from MPIC public APIs
  • Merges instrument configs into EventStreamConfig's API for exporting event stream configs.
  • Sets config var for user's experiments' enrollment data
  • Buckets logged-in users into control and treatment cohorts during an active experiment into which users are enrolled

Code repo - https://gerrit.wikimedia.org/g/mediawiki/extensions/MetricsPlatform

Metrics Platform client libraries (rename TBD - T381285: Create work plan for renaming of Metrics Platform to Experimentation Lab)

Function:

  • Accepts instrument name as a unique identifier for MP's monotable
  • Fetches experiment enrollment data for sending with events by checking for config var set by Metrics Platform extension.
  • Contains standardized instruments for collecting data on Clickthrough Rates
  • Submits events to EventGate intake service

Work in progress:

Code repo - https://gitlab.wikimedia.org/repos/data-engineering/metrics-platform

Metrics Platform schemas (rename TBD - T381285: Create work plan for renaming of Metrics Platform to Experimentation Lab)

Function:

  • Validates events sent via MP client libraries
  • Enforces data contract of MP-based instruments

Code repo - https://gitlab.wikimedia.org/repos/data-engineering/schemas-event-secondary/-/tree/master/jsonschema/analytics/product_metrics

Acceptance Criteria

  • A plan is documented
  • Necessary tickets are created

Related Objects

StatusSubtypeAssignedTask
Resolvedcjming
Resolvedcjming
Resolvedphuedx
Resolvedphuedx
ResolvedJVanderhoop-WMF
Resolvedphuedx
Resolvedcjming
Declinedcjming
Resolvedcjming
InvalidNone
Resolvedcjming
InvalidNone
DeclinedNone
InvalidNone
InvalidNone
DuplicateNone
DeclinedNone
DeclinedNone
Resolvedphuedx
Resolvedphuedx
Resolvedcjming

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

Mentioned in SAL (#wikimedia-operations) [2025-01-20T14:36:19Z] <lucaswerkmeister-wmde@deploy2002> Started scap sync-world: Backport for [[gerrit:1112707|Add dedicated experimentation lab test module (T373715)]]

Mentioned in SAL (#wikimedia-operations) [2025-01-20T14:40:46Z] <lucaswerkmeister-wmde@deploy2002> lucaswerkmeister-wmde, cjming: Backport for [[gerrit:1112707|Add dedicated experimentation lab test module (T373715)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug)

Mentioned in SAL (#wikimedia-operations) [2025-01-20T14:48:31Z] <lucaswerkmeister-wmde@deploy2002> Finished scap sync-world: Backport for [[gerrit:1112707|Add dedicated experimentation lab test module (T373715)]] (duration: 12m 12s)

Change #1113185 had a related patch set uploaded (by Clare Ming; author: Clare Ming):

[mediawiki/extensions/WikimediaEvents@master] Fix schema version for CTR instrument

https://gerrit.wikimedia.org/r/1113185

Change #1113185 abandoned by Clare Ming:

[mediawiki/extensions/WikimediaEvents@master] Fix schema version for CTR instrument

Reason:

wrong branch

https://gerrit.wikimedia.org/r/1113185

Verified data in Hive after backport of schema version change with the following query:

select * from product_metrics_web_base where year=2025 and month=1 and day=21 and hour>20;

Virginia's checklist (see T373715#10459084)

Answer the following questions for logged in users:

Do docs support the process? WIP
Is CTR instrument data collection accurate / data correctness? in review
Does our system support the load we would expect to see from wikis? uncertain
Does our bucketing feature work as expected? in review

Change #1113511 had a related patch set uploaded (by Clare Ming; author: Clare Ming):

[operations/mediawiki-config@master] Enable ExLab test 1 experiment to wikitech

https://gerrit.wikimedia.org/r/1113511

Change #1113512 had a related patch set uploaded (by Clare Ming; author: Clare Ming):

[operations/mediawiki-config@master] Add a few more contextual attributes to web base

https://gerrit.wikimedia.org/r/1113512

Things that I've observed/noted whilst working on this with @Sfaci and @cjming:

  1. We can't rely on browser DevTools (see T384307). In certain circumstances, browser DevTools report beacon requests either not happening or failing but the requests do succeed in the background. We need to increase confidence in the browser-side part of the analytics event submission pipeline for ourselves and for feature teams
    • This could include simplifying the event submission pipeline by removing BackgroundQueue entirely
  2. The MetricsPlatform extension doesn't have much debugging logging. Trying to figure out why the MetricsPlatform extension wasn't working was difficult without only a single warning to go on
  3. We're logging performance metrics to Prometheus. We should create a dashboard

Change #1113511 merged by jenkins-bot:

[operations/mediawiki-config@master] Enable ExLab test 1 experiment to wikitech

https://gerrit.wikimedia.org/r/1113511

Mentioned in SAL (#wikimedia-operations) [2025-01-22T21:07:34Z] <cjming@deploy2002> Started scap sync-world: Backport for [[gerrit:1113511|Enable ExLab test 1 experiment to wikitech (T373715)]]

Mentioned in SAL (#wikimedia-operations) [2025-01-22T21:13:52Z] <cjming@deploy2002> cjming: Backport for [[gerrit:1113511|Enable ExLab test 1 experiment to wikitech (T373715)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug)

Mentioned in SAL (#wikimedia-operations) [2025-01-22T21:20:56Z] <cjming@deploy2002> Finished scap sync-world: Backport for [[gerrit:1113511|Enable ExLab test 1 experiment to wikitech (T373715)]] (duration: 13m 22s)

Change #1113512 merged by jenkins-bot:

[operations/mediawiki-config@master] Add a few more contextual attributes to web base

https://gerrit.wikimedia.org/r/1113512

Mentioned in SAL (#wikimedia-operations) [2025-01-22T21:22:30Z] <cjming@deploy2002> Started scap sync-world: Backport for [[gerrit:1113512|Add a few more contextual attributes to web base (T373715)]]

Mentioned in SAL (#wikimedia-operations) [2025-01-22T21:27:12Z] <cjming@deploy2002> cjming: Backport for [[gerrit:1113512|Add a few more contextual attributes to web base (T373715)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug)

Mentioned in SAL (#wikimedia-operations) [2025-01-22T21:34:11Z] <cjming@deploy2002> Finished scap sync-world: Backport for [[gerrit:1113512|Add a few more contextual attributes to web base (T373715)]] (duration: 11m 41s)

  1. We can't rely on browser DevTools (see T384307). In certain circumstances, browser DevTools report beacon requests either not happening or failing but the requests do succeed in the background. We need to increase confidence in the browser-side part of the analytics event submission pipeline for ourselves and for feature teams
    • This could include simplifying the event submission pipeline by removing BackgroundQueue entirely

Idk that I understand the problem or the solution enough to write up a ticket - should it be a spike?

  1. The MetricsPlatform extension doesn't have much debugging logging. Trying to figure out why the MetricsPlatform extension wasn't working was difficult without only a single warning to go on

Filed T384562: MetricsPlatform: Add more debug logging

  1. We're logging performance metrics to Prometheus. We should create a dashboard

Filed T384563: Create dashboard for performance logging to Prometheus

I'm seeing strange behavior on testwiki and wikitech in Chrome:

Screenshot 2025-01-22 at 9.04.39 PM.png (1×3 px, 811 KB)

I can't figure out if this is just me - it doesn't seem to happen in Firefox. But I also don't know how these wikis behave normally - I wanted to trigger events for the ExLab Test 1 experiment but each site seems to choke with an ERR_INTERNET_DISCONNECTED error when I reload the page after a few times. Will keep investigating...

hi @phuedx and @Sfaci -- I am not seeing events for the following queries after I deployed the ExLab Test 1 experiment to wikitech at UTC 21:00:

select count(*) from product_metrics_web_base where year=2025 and month=1 and day=22 and hour>20;

select count(*) from product_metrics_web_base where year=2025 and month=1 and day=23;

I don't know that I'm properly triggering events either because of the buggy behavior i noted in T373715#10487297

hi @phuedx and @Sfaci -- I am not seeing events for the following queries after I deployed the ExLab Test 1 experiment to wikitech at UTC 21:00:

select count(*) from product_metrics_web_base where year=2025 and month=1 and day=22 and hour>20;

select count(*) from product_metrics_web_base where year=2025 and month=1 and day=23;
I don't know that I'm properly triggering events either because of the buggy behavior i noted in T373715#10487297

No events yet at this time!

I have tried this morning (testing wikitech) and I don't see any impression or click event being launched in wikitech. In fact, our wgMetricsPlatformUserExperiments config variable is empty for that wiki. After seeing that I took a look at the logs (as Sam showed me the other day using Wikimediadebug extension) and I have found the following in logstash which is the same we saw before while testing testwiki for the first time. It seems that wikitech is loading an old version of the MetricsPlatform, right?:
Dependencies not met for the Metrics Platform Instrument Configs Fetcher.

Screenshot 2025-01-23 at 10.40.29.png (92×1 px, 20 KB)

What I don't know is what those errors you showed before mean because I see no request related to sending events in the Network tab when testing wikitech. At least, when reloading the page and clicking the hide button of the main menu.

I have just realized that the name of the wiki that appears related to the error I have pasted above is labswiki. Should we use that name in this context?

I have tried this morning (testing wikitech) and I don't see any impression or click event being launched in wikitech. In fact, our wgMetricsPlatformUserExperiments config variable is empty for that wiki. After seeing that I took a look at the logs (as Sam showed me the other day using Wikimediadebug extension) and I have found the following in logstash which is the same we saw before while testing testwiki for the first time. It seems that wikitech is loading an old version of the MetricsPlatform, right?:
Dependencies not met for the Metrics Platform Instrument Configs Fetcher.

Confirmed. Looking at https://versions.toolforge.org/, Wikitech (DBname: labswiki) is still running the -wmf.12 branch.

Verifying that after we rolled the test experiment to wikitech (labswiki), we have impressions & clicks!

Screenshot 2025-01-24 at 2.47.19 PM.png (240×2 px, 65 KB)

More verification queries for labswiki:

Screenshot 2025-01-28 at 10.11.36 PM.png (1×2 px, 499 KB)

Sample event data - impressions and clicks on labswiki:

Superset dashboard for testwiki and labswiki:
https://superset.wikimedia.org/superset/dashboard/p/D2EB5pJvyqp/

Change #1118810 had a related patch set uploaded (by Phuedx; author: Phuedx):

[operations/mediawiki-config@master] [Experiment Platform]: Disable experiments

https://gerrit.wikimedia.org/r/1118810

Change #1118810 merged by jenkins-bot:

[operations/mediawiki-config@master] [Experiment Platform]: Disable experiments

https://gerrit.wikimedia.org/r/1118810

Mentioned in SAL (#wikimedia-operations) [2025-02-11T14:22:02Z] <urbanecm@deploy2002> Started scap sync-world: Backport for [[gerrit:1118810|[Experiment Platform]: Disable experiments (T373715 T383801)]], [[gerrit:1118811|refactor(AddLink): ignore rows with null in Store (T382270)]]

Mentioned in SAL (#wikimedia-operations) [2025-02-11T14:24:59Z] <urbanecm@deploy2002> phuedx, migr, urbanecm: Backport for [[gerrit:1118810|[Experiment Platform]: Disable experiments (T373715 T383801)]], [[gerrit:1118811|refactor(AddLink): ignore rows with null in Store (T382270)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug)

Mentioned in SAL (#wikimedia-operations) [2025-02-11T14:33:39Z] <urbanecm@deploy2002> Finished scap sync-world: Backport for [[gerrit:1118810|[Experiment Platform]: Disable experiments (T373715 T383801)]], [[gerrit:1118811|refactor(AddLink): ignore rows with null in Store (T382270)]] (duration: 11m 36s)

To close out this ticket, we have a slide dek + embedded demo to document the success of the dogfooding experiment: xLab Technical Details
(demo on last slide)

Change #1122226 had a related patch set uploaded (by Clare Ming; author: Clare Ming):

[mediawiki/extensions/WikimediaEvents@master] Start test experiment for all enrolled users.

https://gerrit.wikimedia.org/r/1122226

Change #1122226 merged by jenkins-bot:

[mediawiki/extensions/WikimediaEvents@master] Start test experiment for all enrolled users.

https://gerrit.wikimedia.org/r/1122226

Change #1122236 had a related patch set uploaded (by Clare Ming; author: Clare Ming):

[mediawiki/extensions/WikimediaEvents@wmf/1.44.0-wmf.17] Start test experiment for all enrolled users.

https://gerrit.wikimedia.org/r/1122236

Change #1122236 merged by jenkins-bot:

[mediawiki/extensions/WikimediaEvents@wmf/1.44.0-wmf.17] Start test experiment for all enrolled users.

https://gerrit.wikimedia.org/r/1122236

Mentioned in SAL (#wikimedia-operations) [2025-02-24T22:59:30Z] <cjming@deploy2002> Started scap sync-world: Backport for [[gerrit:1122236|Start test experiment for all enrolled users. (T373715)]]

Mentioned in SAL (#wikimedia-operations) [2025-02-24T23:02:14Z] <cjming@deploy2002> cjming: Backport for [[gerrit:1122236|Start test experiment for all enrolled users. (T373715)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug)

Mentioned in SAL (#wikimedia-operations) [2025-02-24T23:09:03Z] <cjming@deploy2002> Finished scap sync-world: Backport for [[gerrit:1122236|Start test experiment for all enrolled users. (T373715)]] (duration: 09m 32s)

Change #1123492 had a related patch set uploaded (by Santiago Faci; author: Santiago Faci):

[mediawiki/extensions/WikimediaEvents@master] ExLabTest1: Fixing wrong order of parameters when creating the instrument

https://gerrit.wikimedia.org/r/1123492

Change #1123492 merged by jenkins-bot:

[mediawiki/extensions/WikimediaEvents@master] ExLabTest1: Fixing wrong order of parameters when creating the instrument

https://gerrit.wikimedia.org/r/1123492

I've been bold and removed T384506: Update event debug logging in EventLogging extension from the list of subtasks of this task. At best, it was an outcome of this work and feedback from the Web team as they worked on an implementation of SessionTick for their most recent experiment but not strictly a requirement of the initial delivery of xLab (AKA MPIC).