Page MenuHomePhabricator

X-Experiment-Enrollments EventGate handling reinforcement for MalformedHeaderError cases
Closed, ResolvedPublic5 Estimated Story PointsBUG REPORT

Description

In T401705#11335589 it's noted that the routine at experiments.js#L44 could be made to be less forgiving. Presently if the initial regex doesn't pass it returns an empty set, not throwing a MalformedHeaderError, whereas if any of the three components of an experiment record in the X-Experiments-Enrollment is actually malformed that does throw a MalformedHeaderError.

Suggested updated approach:

  1. If the header value is empty or consists of whitespace throw a MalformedHeaderError with a message of "x-experiment-enrollments is empty or whitespace only"
  2. If the initial regex fails otherwise, throw a MalformedHeaderError with a message of "x-experiment-enrollments header is malformed, not matching initial regular expression"
  3. If the initial regex succeeds but one or more of the specific components turns out to be problematic, specify which via the MalformedHeaderError:

"x-experiment-enrollments header is malformed due to: <experiment name:[experiment name], group name:[group name], subject ID:[length V, WX...YZ ]>" - where the subject ID V value is a number, and W, X, Y, and Z are the first two and last two characters if there are 4 characters and the rest is redacted (otherwise if the value is three or fewer characters, whatever is in the string should be logged to Logstash).

Furthermore, the routine at experiments.js#L44 should be modified to log in a more helpful way (still being careful to avoid unnecessary data exposure).

Unfortunately, if we hit a scenario where there are multiple concurrent experiments for which the user is enrolled and just one of them is proving to be a problem in the header records, that means we'll get an error affecting all of them, which is suboptimal for data collection.

We should for now probably continue to be strict in this case by throwing the exception; but just be mindful that we may later need to apply both stringency and leniency here (e.g., in addition to the upfront initial regex check, merely log the issue for any record for which an experiment name / group name / subject ID is problematic and discard that record, but allowing any good enrollments into the resultant array)...but, again, for now it's probably best to stay on the stringent side.

For this to be complete, there are two testing requirements:

  1. the unit tests should be updated
  2. the work should be tested by making a shell script which tests the classes of issues available by creating malformed headers on the eventgate instance

Details

Related Changes in Gerrit:
Related Changes in GitLab:
TitleReferenceAuthorSource BranchDest Branch
Distinguish the source of MalformedHeaderErrorsrepos/data-engineering/eventgate-wikimedia!31phuedxwork/phuedx/experiment_error_loggingmaster
Customize query in GitLab

Event Timeline

Cross-linking an item discussed in chat from before the break: T413275: Data instruments failing to hoist Sticky Headers experiment setup

This sort of symptom is affecting an SLO, see the following for example:

T412493: ErrorBudgetBurn
T410835: ErrorBudgetBurn
T412467: ErrorBudgetBurn (sticky-headers, part 2) (this one contains links bearing Prometheus / Thanos stuff, and the PromQL may be handy while working this here task)
T412448: ErrorBudgetBurn (sticky-headers) (closed in favor of ^)

Milimetric subscribed.

looks good, please re-add us if you need support

KReid-WMF set the point value for this task to 5.

Change #1248507 had a related patch set uploaded (by TChin; author: TChin):

[operations/deployment-charts@master] [eventgate] bump to v1.28.0

https://gerrit.wikimedia.org/r/1248507

Change #1248507 merged by jenkins-bot:

[operations/deployment-charts@master] [eventgate] bump to v1.28.0

https://gerrit.wikimedia.org/r/1248507

Eventgate v1.28.0 is now deployed