Page MenuHomePhabricator

EventGate: Investigate data loss during the SDS 2.4.11 Synthetic A/A Test experiment
Closed, ResolvedPublic2 Estimated Story Points

Description

In the last phase of the synth A/A we:

  • Received 2686280 events
  • Logged 2551075 "x-experiment-enrollments header is malformed" log lines

That's an apparent data loss of ~48.71%

This task covers the investigation of the above.

Event Timeline

@BBlack @Vgutierrez: Further to the small amount of detail in the task description, we saw what appears to be a significant rate of rejections of the X-Experiment-Enrollments header by EventGate.

The code that parses the header is here: https://gitlab.wikimedia.org/repos/data-engineering/eventgate-wikimedia/-/blob/master/lib/experiments.js?ref_type=heads#L53 . I'll keep investigating but if there's anything that jumps out at you, then please let me know.

could getXExperimentEnrollments be executed for requests where the original path isn't /evt-103e/v2/events? I'm asking this because:

  • varnish sets X-E-E content based on the original URI Path
  • ATS will rewrite /evt-103e/v2/events to /v1/events

So it's totally possible that requests headed to intake-analytics.wm.o with WMFUniq cookie set but not targeting /evt-103e/v2/events would end up on your backend service with a valid X-E-E content that doesn't match the regex on getXExperimentEnrollments

So it's totally possible that requests headed to intake-analytics.wm.o with WMFUniq cookie set but not targeting /evt-103e/v2/events would end up on your backend service with a valid X-E-E content that doesn't match the regex on getXExperimentEnrollments

Noted. Thanks!

Unfortunately, this doesn't seem to be the case. All of the requests are to send events to the product_metrics.web_base stream, which is only used by the experiment code, which is configured to send events via the /evt-103e/v2/events path.

phuedx triaged this task as High priority.Jun 10 2025, 1:42 PM
phuedx moved this task from Incoming to Backlog on the Test Kitchen board.

@BBlack noted that the hashed edge unique values are base64url encoded, not plain base64. @tchin created a patch for data-engineering/eventgate-wikimedia!17 and deployed it. I re-activated the A/A test and took a first look at the data flowing in before and after the deployment, and it seems about on the order of what we might expect for this update, although a couple data points should be further validated in the next day or two:

  1. That the A/A ratios still look good and that they're appropriately proportional for the user traffic. @mpopov probably best if you have a look, I think.
  2. That the event loss now matches expectations. @phuedx probably best if you have a look, I think.

From an update I posted a bit earlier on Slack:

Events seem to be coming in at a good steady clip, and the malformed events subsided. Grabbing a couple points for event rate - 16.3/s before the change and 29.6/s after the change - it seems within reason for what we were hoping for. We should do a fuller accounting tomorrow, of course, but this looks good so far. I'll post the attached graphs to the task.

Thanks @tchin for the fix and deployment, and thanks @phuedx and @BBlack for spotting the problem and the apparent solution. Thanks also @Vgutierrez for your investigation up above!

Screenshot 2025-06-10 at 3.47.14 PM.png (1×3 px, 549 KB)

Screenshot 2025-06-10 at 3.48.56 PM.png (1×3 px, 462 KB)

Screenshot 2025-06-10 at 3.50.04 PM.png (738×3 px, 198 KB)

  1. That the event loss now matches expectations. @phuedx probably best if you have a look, I think.

The SDS 2.4.11 Synthetic A/A Test experiment was re-enabled at approx. 2025-06-10T18:35:00Z. The fix for the X-Experiment-Enrollments header was deployed at or around 20:00:00Z the same day. You can see the rapid fall off in "x-experiments-enrollments header is malformed" errors following the deployment here: https://logstash.wikimedia.org/goto/3a83d4479f4a963206196042e6113dcf

LGTM!

Milimetric set the point value for this task to 2.Jun 12 2025, 4:36 PM
Milimetric claimed this task.