Page MenuHomePhabricator

EventGate: Log unparseable X-Experiment-Enrollments headers to an error stream
Closed, ResolvedPublic3 Estimated Story Points

Description

Background

From my notes:

  • EventGate failed to parse the X-Experiment-Enrollments header sent by Varnish for approx. 48.71% of the events received
  • There were 152 validation errors caused by an otherwise-valid X-Experiment-Enrollments header sent by Varnish not including the required experiment or the required group. I will try to determine whether its one or both of those scenarios
enrolled                    assigned.     n
"sds2-4-11-synth-aa-test"   "control"     77
"sds2-4-11-synth-aa-test"   "control-2"   75


SELECT
  json_extract(raw_event, '$.experiment.enrolled') AS enrolled,
  json_extract(raw_event, '$.experiment.assigned') AS assigned,
  COUNT(*) AS n
FROM
   event.eventgate_analytics_external_error_validation 
WHERE
  year = 2025
  AND (
    (day = 6 AND hour >= 11)
    OR day >= 7
)

AND errored_stream_name = 'product_metrics.web_base'
GROUP BY
  1, 2
ORDER BY
  n DESC
;

Without seeing the value of the X-Experiment-Enrollments header, it's hard to understand precisely what's going on. I propose logging the [raw] value and, if possible, the parsed value without subject IDs as context

AC

  • If the header is unparseable, then log an error event to the error stream, e.g. eventgate-analytics-external.error.validation stream.

Notes

  1. The header may contain subject IDs. Logging subject IDs to the new error stream conveys as much risk as logging them to the product_metrics.web_base stream
  2. If this mechanism is still enabled when we begin to run a large number of experiments, there is a chance that we might log subject IDs for concurrent experiments. Logging these errors to a separate error stream allows us to purge them manually or automatically on a different schedule
  3. It would be nice if the stream was named something other than eventgate-analytics-external.error.validation since it will not have more than strictly ValidationErrors. Doing this will be a bit of work. A TODO comment has been added in the code.

Details

Related Changes in Gerrit:
Related Changes in GitLab:
TitleReferenceAuthorSource BranchDest Branch
Don't run ExP related logic if use_edge_uniques is not enabledrepos/data-engineering/eventgate-wikimedia!22ottoT396359_exp_malformedheader_errormaster
Add metric for experiment hoisting errorsrepos/data-engineering/eventgate-wikimedia!21ottoT398922_hoisting_metricmaster
Customize query in GitLab

Event Timeline

phuedx renamed this task from EventGate: Log redacted X-Experiment-Enrollments header as context to EventGate: Log unparseable X-Experiment-Enrollments headers to a distinct error stream.Jun 10 2025, 12:54 PM
phuedx updated the task description. (Show Details)
Milimetric triaged this task as Medium priority.Jul 1 2025, 3:39 PM
Milimetric set the point value for this task to 3.

@phuedx do you have a strong feeling we still need the raw header value? Or, now that we have the base64url format enshrined is it sufficient to just rely upon the message and error_type fields of event.eventgate_analytics_external_error_validation from the targeted HoistingError logging from eventgate-wikimedia.js in your view?

I know I've had an instinctive aversion to dumping the raw header values. But I do understand the potential diagnostic value and understand the threat class equivalence consideration.

I do see the stream goes back to just under 10 days...

$ kafkacat -C -b kafka-jumbo1007.eqiad.wmnet:9092 -t eqiad.eventgate-analytics-external.error.validation -o beginning -c 2

And the ingested validation errors go back 90 days per retention settings.

show partitions eventgate_analytics_external_error_validation;

Thinking out loud: I'm not sure, is there a prescribed way to more rapidly declaratively purge error rows without too much custom stuff? I suppose we could use plain old HQL in data_purge.pp if not; I don't think refinery-drop-older-than / refinery.py supports this, although it would be a nice thing to have available. (Now, I guess if a specific stream materializes into its own table we could have a standalone job that purges that table more regularly without having to change much).

As an aside: I hope we never encounter this, but @phuedx / @Ottomata for MalformedHeaderError error log lines from experiments.js is there a quick way at this for queries from a stat box shell or Superset so as to get the rough equivalent of the logstash OpenSearch Dashboard ? I wasn't sure if there may be a way to cheaply ksql over some stream in kafka-logging or something similar (I'm mindful of T347430#11005224 , but thinking about near term). I understand it's possible, although not always encouraged, to probe the OpenSearch dashboard data, but I'm wondering more for the sake of being able to use plain old SQL-like syntax.

@phuedx do you have a strong feeling we still need the raw header value? Or, now that we have the base64url format enshrined is it sufficient to just rely upon the message and error_type fields of event.eventgate_analytics_external_error_validation from the targeted HoistingError logging from eventgate-wikimedia.js in your view?

If you'd have asked me while the base64 / base64url dataloss incident was ongoing, then I would have said yes. However, we have since validated the integration between Varnish and EventGate with a number of everyone experiments. MalformedHeaderError events occurring is more a signal that the two components have de-synced and we should look at both sides to see what has changed (remind me to note this on wikitech:Experimentation_Lab/Troubleshooting). We don't need to know the raw values in order to know that the two components have de-synced. Further, logging the values doesn't change the threat class but it does increase surface area. With all of that said, I'm inclined to say no.

is there a quick way at this for queries from a stat box shell or Superset so as to get the rough equivalent of the logstash OpenSearch Dashboard ?

No, that would be T291645: Produce ECS formatted logstash logs to Event Platform, allowing them to be queried in the WMF Data Lake with SQL, and more generally T390323: Data Platform, SRE Observability, overlaps, use cases, and potential.

I wasn't sure if there may be a way to cheaply ksql over some stream in kafka-logging or something similar

This has been requested before, but for event platform streams that are emitting to kafka logging clusters. T304373: Also intake Network Error Logging events into the Analytics Data Lake.

Ottomata renamed this task from EventGate: Log unparseable X-Experiment-Enrollments headers to a distinct error stream to EventGate: Log unparseable X-Experiment-Enrollments headers to an error stream.Jul 29 2025, 1:50 PM
Ottomata updated the task description. (Show Details)

Deployed and tested code in beta, but I can't directly test MalformedHeaderErrors there.

I'm unsure if this is working fully, although it seems to be in part.

It seems that experiments.js's getXExperimentEnrollments method may have a logic issue that, while guarding against bad headers, may not have the exact behavior that was intended.

If the X-Experiment-Enrollments header bears the value which doesn't match the regex (e.g., X-Experiment-Enrollments: boo), the regex may have no matches, and I believe thus experiments.js's getEnrolledExperiment may return an empty value, and then a HoistingError will be thrown instead of a MalformedHeaderError.

Now, if sending a header that passes the regular expression (so has at least one match) but that fails further validation (in this case the subject_id component is not of sufficient length, for example)...

deployment-deploy04:~$ cat beta.bad.malformed.bash 
curl -i -s -H "X-Experiment-Enrollments:foo=bar/1234;" -H "Content-Type: application/json"  -X POST http://deployment-eventgate-4.deployment-prep.eqiad1.wikimedia.cloud:8492/v1/events -d @beta.bad.subject_id_ok_though.json

deployment-deploy04:~$ bash beta.bad.malformed.bash 
HTTP/1.1 500 1 out of 1 events had failures and were not accepted. (0 invalid and 1 errored).
access-control-allow-origin: *
access-control-allow-headers: accept, x-requested-with, content-type
access-control-expose-headers: etag
access-control-allow-methods: post
Content-Type: application/json; charset=utf-8
Content-Length: 699
Vary: Accept-Encoding
Date: Wed, 30 Jul 2025 00:39:34 GMT
Connection: keep-alive
Keep-Alive: timeout=5

{"invalid":[],"error":[{"status":"error","event":{"$schema":"/analytics/product_metrics/web/base/1.4.2","dt":"2025-07-29T11:12:11.111Z","meta":{"stream":"product_metrics.web_base","id":"82f62ee7-6afc-4269-ac93-44acdf30d35d","dt":"2025-07-30T00:39:34.298Z","request_id":"a9ed0360-6cdd-11f0-b0da-dd91866a4461"},"action":"clicked","experiment":{"enrolled":"conveniencestore","assigned":"energydrinks","subject_id":"awaiting","sampling_unit":"edge-unique","coordinator":"forced","other_assigned":{"ice_cream_cone":"yes"}},"agent":{"client_platform":"mediawiki_js","client_platform_family":"desktop_browser","release_status":"dev"}},"context":{"message":"x-experiment-enrollments header is malformed"}}]}

...it does seem to trigger a MalformedHeaderError, but it doesn't seem (?) to be showing up as a logged error in the stream?

deployment-kafka-jumbo-9:~$ kafkacat -C -b deployment-kafka-jumbo-9.deployment-prep.eqiad1.wikimedia.cloud:9092 -t eqiad.eventgate-analytics-external.error.validation

# ...previous errors, not the MalformedHeaderError one, though, it seems

Although the apparent (?) logic issue for an empty set regex match maybe needs a new task filed (or a patch added to this ticket if that's okay), as for the MalformedHeaderError showing up in a stream, I'm wondering what gives? I may be mistaken and need to look at this when I'm fresher, but wanted to pass this along.

I'll try to reproduce the MalformedHeaderError in error stream thing.

Feel free to add patch to this task to fix logic.

Ah! It is failing because when trying to produce the error event, the same ExP header checking logic is running. It is still in the same request context, so the incoming ExP headers are set. The error event schema doesn't have e.g. a subject_id field, so when producing we get this error:

Could not hoist data into experiment.subject_id: Schema does not have a experiment field.

Hm, what to do? I don't think we should mutate the incoming request headers. What can we add to this conditional to ensure that only the right events enter it?

Oh wait! That block is mutating the request headers. So this is not happening in prod for ValidationErrors. Hm. I think this is happening on MalformedHeaderError because that is thrown elsewhere in experiment.js. So, when throwing MalformedHeaderError, should the X-Experiment-Enrollments header be removed too?

Somehow mutating the request headers like this seems wrong. Getting the conditional right seems better. @dr0ptp4kt whatchya think?

Could we just put if useEdgeUniques in the conditional? That is a stream related setting.

Could we just put if useEdgeUniques in the conditional? That is a stream related setting.

Did it: https://gitlab.wikimedia.org/repos/data-engineering/eventgate-wikimedia/-/merge_requests/22

I also took the liberty of not mutating request.header. I hope that is okay!

Ah! It is failing because when trying to produce the error event, the same ExP header checking logic is running. It is still in the same request context, so the incoming ExP headers are set. The error event schema doesn't have e.g. a subject_id field, so when producing we get this error:

Could not hoist data into experiment.subject_id: Schema does not have a experiment field.

Oh ha, that's right, this was one of the things we hit that was addressed in https://gitlab.wikimedia.org/repos/data-engineering/eventgate-wikimedia/-/merge_requests/15/diffs (I should have remembered this could happen again :) ).

Hm, what to do? I don't think we should mutate the incoming request headers. What can we add to this conditional to ensure that only the right events enter it?

Oh wait! That block is mutating the request headers. So this is not happening in prod for ValidationErrors. Hm. I think this is happening on MalformedHeaderError because that is thrown elsewhere in experiment.js. So, when throwing MalformedHeaderError, should the X-Experiment-Enrollments header be removed too?

Somehow mutating the request headers like this seems wrong. Getting the conditional right seems better. @dr0ptp4kt whatchya think?

I see the patch posted via the subsequent comments / Slack. I'll have a short think on this. It's good to have visibility into the different unexpected things that might happen, but I gotta reason through this just a little.

BTW, the /v1/events API allows for POSTing an array of events, each of which could potentially be a for a different stream or schema. So, handling of stream specific settings like producers.eventgate.use_edge_uniques really should be targeted for when processing an event destined to that stream.

BTW, the /v1/events API allows for POSTing an array of events, each of which could potentially be a for a different stream or schema. So, handling of stream specific settings like producers.eventgate.use_edge_uniques really should be targeted for when processing an event destined to that stream.

Right. Although the internet-facing URL used in an xLab JS eventing context (which gets rewritten) is specific so that is supposed to be avoided. I added a question in the code review about if there's way to have some sort of visibility into if we're in a situation where X-Experiment-Enrollments is coming across the wire but it's for a stream that isn't supposed to be doing that.

Added logging for when X-Experiment-Enrollments is set but use_edge_uniques is not. I log at warn level usually, but if the event is destined to the error_stream I log at debug level. This does mean that a baddy could POST to the error_stream with X-Experiment-Enrollments set and we wouldn't see this log message normally, but perhaps this is okay? We will see other log messages and/or error events in the stream.

(I wonder if we should disable external POSTing of error events, hm...)

Change #1174745 had a related patch set uploaded (by Ottomata; author: Ottomata):

[operations/deployment-charts@master] eventgate-analytics-external - bump to 1.19.0 to pick up recent ExP changes

https://gerrit.wikimedia.org/r/1174745

Change #1174745 merged by jenkins-bot:

[operations/deployment-charts@master] eventgate-analytics-external - bump to 1.19.0 to pick up recent ExP changes

https://gerrit.wikimedia.org/r/1174745

Mentioned in SAL (#wikimedia-operations) [2025-07-31T15:59:57Z] <ottomata> deploying eventgate-analytics external for T398922 and T396359

Deployed to eventgate-analytics-external in prod!

As noted in T398922#11060648 it's working in prod. Thanks @Ottomata! Now, there's a TODO for the following that I/we will need to enqueue in a task, but not relevant to this here task.

...
It seems that experiments.js's getXExperimentEnrollments method may have a logic issue that, while guarding against bad headers, may not have the exact behavior that was intended.

If the X-Experiment-Enrollments header bears the value which doesn't match the regex (e.g., X-Experiment-Enrollments: boo), the regex may have no matches, and I believe thus experiments.js's getEnrolledExperiment may return an empty value, and then a HoistingError will be thrown instead of a MalformedHeaderError.
....

I verified this again by setting a simple X-Experiment-Enrollments header of a plain alphabetical string (i.e., no = or / separators or ; record delimiter/terminator).

Anyway, thanks again @Ottomata!