In the last phase of the synth A/A we:
- Received 2686280 events
- Logged 2551075 "x-experiment-enrollments header is malformed" log lines
That's an apparent data loss of ~48.71%
This task covers the investigation of the above.
| phuedx | |
| Jun 10 2025, 12:18 PM |
| F62286281: Screenshot 2025-06-10 at 3.50.04 PM.png | |
| Jun 10 2025, 9:15 PM |
| F62286282: Screenshot 2025-06-10 at 3.48.56 PM.png | |
| Jun 10 2025, 9:15 PM |
| F62286283: Screenshot 2025-06-10 at 3.47.14 PM.png | |
| Jun 10 2025, 9:15 PM |
In the last phase of the synth A/A we:
- Received 2686280 events
- Logged 2551075 "x-experiment-enrollments header is malformed" log lines
That's an apparent data loss of ~48.71%
This task covers the investigation of the above.
| Status | Subtype | Assigned | Task | ||
|---|---|---|---|---|---|
| Resolved | Milimetric | T396474 EventGate: Investigate data loss during the SDS 2.4.11 Synthetic A/A Test experiment | |||
| Resolved | Ottomata | T396359 EventGate: Log unparseable X-Experiment-Enrollments headers to an error stream |
@BBlack @Vgutierrez: Further to the small amount of detail in the task description, we saw what appears to be a significant rate of rejections of the X-Experiment-Enrollments header by EventGate.
The code that parses the header is here: https://gitlab.wikimedia.org/repos/data-engineering/eventgate-wikimedia/-/blob/master/lib/experiments.js?ref_type=heads#L53 . I'll keep investigating but if there's anything that jumps out at you, then please let me know.
could getXExperimentEnrollments be executed for requests where the original path isn't /evt-103e/v2/events? I'm asking this because:
So it's totally possible that requests headed to intake-analytics.wm.o with WMFUniq cookie set but not targeting /evt-103e/v2/events would end up on your backend service with a valid X-E-E content that doesn't match the regex on getXExperimentEnrollments
Noted. Thanks!
Unfortunately, this doesn't seem to be the case. All of the requests are to send events to the product_metrics.web_base stream, which is only used by the experiment code, which is configured to send events via the /evt-103e/v2/events path.
@BBlack noted that the hashed edge unique values are base64url encoded, not plain base64. @tchin created a patch for data-engineering/eventgate-wikimedia!17 and deployed it. I re-activated the A/A test and took a first look at the data flowing in before and after the deployment, and it seems about on the order of what we might expect for this update, although a couple data points should be further validated in the next day or two:
From an update I posted a bit earlier on Slack:
Events seem to be coming in at a good steady clip, and the malformed events subsided. Grabbing a couple points for event rate - 16.3/s before the change and 29.6/s after the change - it seems within reason for what we were hoping for. We should do a fuller accounting tomorrow, of course, but this looks good so far. I'll post the attached graphs to the task.
Thanks @tchin for the fix and deployment, and thanks @phuedx and @BBlack for spotting the problem and the apparent solution. Thanks also @Vgutierrez for your investigation up above!
The SDS 2.4.11 Synthetic A/A Test experiment was re-enabled at approx. 2025-06-10T18:35:00Z. The fix for the X-Experiment-Enrollments header was deployed at or around 20:00:00Z the same day. You can see the rapid fall off in "x-experiments-enrollments header is malformed" errors following the deployment here: https://logstash.wikimedia.org/goto/3a83d4479f4a963206196042e6113dcf
LGTM!