Page MenuHomePhabricator

Could not hoist data into experiment.subject_id for event
Open, Needs TriagePublicPRODUCTION ERROR

Description

Error
message
	Could not hoist data into experiment.subject_id for event event of schema at /analytics/product_metrics/web/base/2.0.0 destined to stream mediawiki.product_metrics.contributors.experiments: x-experiment-enrollments header does not have a matching enrollment in event data: One of 'experiment.enrolled' or 'experiment.assigned' fields does not have matching experiment or group name in header.
event
{
    "instrument_name": "ClickThroughRateInstrument",
    "funnel_entry_token": "redacted",
    "element_friendly_name": "Sign up",
    "experiment": {
        "enrolled": "growthexperiments-editattempt-anonwarning",
        "assigned": "control",
        "subject_id": "awaiting",
        "sampling_unit": "edge-unique",
        "coordinator": "default"
    },
    "$schema": "/analytics/product_metrics/web/base/2.0.0",
    "meta": {
        "domain": "ru.wikipedia.org",
        "stream": "mediawiki.product_metrics.contributors.experiments",
        "id": "4f2e6558-710c-44fb-b24e-84d96c8daad0",
        "dt": "2026-03-24T15:32:59.787Z",
        "request_id": "02a675d7-0116-4c51-8266-c7502abdef2f"
    },
    "dt": "2026-03-24T15:32:55.426Z",
    "action": "impression",
    "agent": {
        "client_platform": "mediawiki_js",
        "client_platform_family": "mobile_browser"
    },
    "performer": {
        "is_logged_in": false,
        "is_bot": false,
        "pageview_id": "redacted",
        "active_browsing_session_token": "redacted",
        "id": 0
    },
    "mediawiki": {
        "skin": "minerva",
        "database": "ruwiki"
    }
}
Impact

55 events in the last 24h. Data for this users is not being collected for the experiment

Notes

This is only happening in wikis where growthexperiments-editattempt-anonwarning experiment is running. See the "subject_id": "awaiting", for some reason eventgate-wikimedia did not find the experiment in eventgate-wikimedia.js#L666.

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald Transcript

Change #1261381 had a related patch set uploaded (by Manvi Kesarwani; author: Manvi Kesarwani):

[mediawiki/extensions/WikimediaEvents@master] T421152: Host experiment subject_id before CTR impressions

https://gerrit.wikimedia.org/r/1261381

I have uploaded a patchset . can you review my patchset.

Change #1261381 had a related patch set uploaded (by Pppery; author: Manvi Kesarwani):

[mediawiki/extensions/WikimediaEvents@master] Host experiment subject_id before CTR impressions

https://gerrit.wikimedia.org/r/1261381

Change #1262074 had a related patch set uploaded (by Michael Große; author: Michael Große):

[mediawiki/extensions/WikimediaEvents@master] fix(SpecialCreateAccount): load ext.wikimediaEvents.testKitchen before accessing

https://gerrit.wikimedia.org/r/1262074

I've filtered out validation errors of this type from the Eventgate validation errors dashboard so that folks don't (1) re-report this issue; and (2) don't think that there's something wrong with their experiment.

I have uploaded a patchset . can you review my patchset.

Thank you for taking the time to submit that patch. However, it won't fix the issue.

I'm not 100% sure where the issue is right now but I'm confident that it's not in the code for the Logged Out Warning Message experiment. I'm confident in this because the validation errors are happening for events flowing on multiple event streams, for multiple experiments and groups.

We're seeing a large number of event validation errors for experiment-related analytics events. The validation error message looks like:

Could not hoist data into experiment.subject_id for event {event}: x-experiment-enrollments header does not have a matching enrollment in event data: One of 'experiment.enrolled' or 'experiment.assigned' fields does not have matching experiment or group name in header.

The validation errors are happening across multiple streams all with their producers.eventgate.use_edge_uniques property set to true.

Based on the validation error message, we know that the events:

And that the x-experiment-enrollments request headers:

We also know that:

We haven't collected a sample of the x-experiment-enrollments header values yet. I think that this is our next move.

Questions:

Could this be browsers rejecting cookies? No. If the browser rejects the Edge Unique cookie, then one will be generated on every request, including the requests to send experiment-related analytics events. Per the above, a request to send an experiment-related analytics event will be rejected if the Edge Unique cookie is fresh.

Could the JS SDK have come out of sync with the initial enrollment config? The initial enrollment config is read by MediaWiki and sent to the browser as part of the initial response. Currently, there is no mechanism by which the JS SDK can update its enrollment configuration. If the user were to refresh the page and their enrollment configuration changed, then they would be served a different initial response.

Why don't we log the value of the header? The x-experiment-enrollments header contains subject IDs, which are derived from the Edge Unique cookie. They are meant to be stored in the Data Lake for 90 days and then purged. Previously, we weren't comfortable with sending subject IDs to OpenSearch, which has a different retention policy.

However, per the above, we already know that the subject IDs are valid. We could log the value of the header with the subject IDs removed or obscured.

Do we know if the events are coming from a specific geo or browser? No. Validation errors are logged with limited information. We do have *some* information about geo from the meta.domain.

P89957
P89958
P89959

Could this be related to traffic that is enrolled in overlapping experiments? This would explain why the validation errors aren't happening all the time. Currently, the PHP and JS SDKs don't include enrollment information about all enrolled experiments 😱

Actions:

  1. Update EventGate to log the value of the x-experiment-enrollments header (with subject IDs removed or obscured/redacted) for these validation errors
  2. Update the PHP and JS SDKs to include enrollment information about all enrolled experiments

Thank you for investigating this, @phuedx! I'm this to our tracking column to let you and your team work on it. Please let us know if there is anything that Growth should be doing!

Change #1261381 abandoned by Sergio Gimeno:

[mediawiki/extensions/WikimediaEvents@master] Host experiment subject_id before CTR impressions

Reason:

Does not fix the issue, T421152#11758361

https://gerrit.wikimedia.org/r/1261381

Change #1262074 merged by jenkins-bot:

[mediawiki/extensions/WikimediaEvents@master] fix(SpecialCreateAccount): load ext.wikimediaEvents.testKitchen before accessing

https://gerrit.wikimedia.org/r/1262074

Change #1262195 had a related patch set uploaded (by TChin; author: TChin):

[operations/deployment-charts@master] [eventgate-analytics-external] Bump to v1.29.0

https://gerrit.wikimedia.org/r/1262195

Change #1262195 merged by jenkins-bot:

[operations/deployment-charts@master] [eventgate-analytics-external] Bump to v1.29.0

https://gerrit.wikimedia.org/r/1262195

Change #1262194 had a related patch set uploaded (by Michael Große; author: Michael Große):

[mediawiki/extensions/WikimediaEvents@master] fix: correct parameters for mw.loader.using

https://gerrit.wikimedia.org/r/1262194

Change #1262194 merged by jenkins-bot:

[mediawiki/extensions/WikimediaEvents@master] fix: correct parameters for mw.loader.using

https://gerrit.wikimedia.org/r/1262194

Change #1269253 had a related patch set uploaded (by Phuedx; author: Phuedx):

[mediawiki/extensions/TestKitchen@master] JS SDK: Include enrollment information for all other experiments

https://gerrit.wikimedia.org/r/1269253

Change #1269253 merged by jenkins-bot:

[mediawiki/extensions/TestKitchen@master] JS SDK: Include enrollment information for all other experiments

https://gerrit.wikimedia.org/r/1269253

I've moved this into BLOCKED because T425096: ConfigsFetcher cache is not multi-DC and T419513: JS SDK: Read everyone experiment enrollment from the WMF-Uniq server timing header are rolling out this week and next week. I'd like to review this bug after those two tasks have been resolved.