Page MenuHomePhabricator

Implement debugging for events in the Javascript SDK
Closed, ResolvedPublic8 Estimated Story Points

Description

Description

Estimate client-side event loss: when a page view should trigger an event but it doesn’t reach the backend.

  • Start with NEL-based ratio - calculate loss ratio (Chrome):
NEL_error_reports (evt-103e path) / (NEL_success_sample * (1/success_sampling_rate)) ?
// or
NEL_error_reports / evt-103e_success_count (Chrome UA filtered) ?
  • Send NEL data to Prometheus/Grafana dashboard
  • Validate common failure modes (DNS blocks, uBlock Origin, extension blocking)
  • Implement proposal in T403507: Baseline rate of user agents that successfully load xLab client
    • Stage 1: fire beacon immediately inside the header in the context of an experiment from a client-side script via ResourceLoader
    • Stage 2: fire a second event when the page is loaded >> this metric is the successful load rate
    • From these events, we calculate a rate of the number of times loading failed:
(Stage 1 - Stage 2) / Stage 1 ?
// or
Stage 2 / (Stage 1 - number of times SDK loading failed) ?
    • The combination of NEL data + SDK load successes will give us an idea of how many events we expect to lose.
  • Extract domains from the NEL data via Airflow job or Superset dashboard

Technical Notes

Telemetry options

  • NEL (Network Error Logging) and Probenet already exist and likely provide “good-enough” signals without new identifiers.
  • NEL is Chrome/Chromium only, but that’s acceptable for trend detection. NEL can report errors and (sampled) successes.
    • Since it's implemented in the browser, no need to build anything new, no user script needing to be loaded so no performance penalty. Though it does not cover non-Chromium browsers (so excluded from analysis), there's a high degree of coverage as majority of traffic is included.
  • Counting evt-103e successes vs NEL error reports for that path yields an actionable failure ratio - watch for spikes rather than chase an absolute truth.
  • We can create Prometheus dashboards from existing NEL data.

Additional Notes

Acceptance Criteria

  • Design gateway for JS SDK → Prometheus ingestion
  • Implement counters in StatsLib for Javascript SDK emissions
  • Add validation/trust model to prevent garbage data
  • Implement LocalStorage buffering/flushing & retry logic

Related Objects

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

You know, ^ reminds me of distributed tracing.

Absolutely.

Most HTTP requests in our infra get an X-Request-Id header. This header is propagated through to all events via eventgate as the meta.request_id field.

This would require a request ID was generated at the edge which is readable via JavaScript. AIUI the edge does generate such a header but it doesn't do so in such a way that it's readable via JavaScript.

readable via JavaScript

Why would it need to be readable by JavaScript? You'd just need to correlate X-Request-Id header in webrequest with meta.request_id in event (which is set by eventgate)

(Um, although, reading that code, I don't see the meta.request_id setting logic. Did it accidentally get removed?!?!?! investigating...)
Oh, duh, it was moved to stream config as part of T382173: Enable Event Platform streams to opt out of collecting User-Agent data.

Best if @phuedx confirms when back, although I'll chime in to keep the conversation going.

IIUC, I think @phuedx's point here is that, provided the user got something that looks like an HTML page (and thus has attendant ResourceLoader stuff, assuming we make the instrument a standard part of all ResourceLoader served for HTML pages), we could determine at the backend of EventGate in Hive stored records if the ratios match up. I think what @phuedx is saying is that because the request ID generated by Varnish isn't logged in webrequest (I think?) and it isn't (apparently / to our knowledge?) available to the JS client code that sends events, it wouldn't be possible to measure the ratio in a terribly reliable way. For dtracing, I don't see this sort of thing [EDIT: that is to say, the Varnish piece; obviously it's there for EventGate, MW PHP, and some other stuff) in place in Jaeger, although that may be partially because no one has gotten around to it yet...and I understand there are also maybe some historical considerations because of the full scale of web traffic even if sampled in a minimalist way (as an aide, for this matter, 1:128 sampling stuff might be a really nice case for tracking full-ish stack for bona fide HTML responses!).

I think @Ottomata you may be suggesting that looking for the ratio of Hive stored records of events to webrequest stored requests for said events we would know how successful that is. I think that's true! I had run a query a while back against raw webrequest logs that suggested that, as long as Varnish got a proper request for the v2 beacon URL, the stored record count in Hive records for these events was incredibly close, so much so as to be something to not worry about, at least during that time frame (we're aware of ways that data could go bad, but it seems to be okay, at least as of that timeslice). That, of course, may not always hold.

There are some complications around all of this, in that it's sometimes tricky to determine if a webpage is an HTML response that's supposed to have events coming from it, even for a very standardized pageview instrument. We could confine the scope of this to things that are 200s on /wiki/ without query parameters or something, and then qualify that in the client-side JS. If the JS has a variable of wgWMESendPageLoadedEvent set to 1, we do need to remember that highly popular pages could have an outsize influence in what gets sent from the browser, and that could technically skew ratios because of the composition of traffic with respect to such material (example is BTS fans are probably more likely on their cellphones or something like that, as opposed to manifold analysis fans being more on their desktops on Ethernet or whatever). But, we can also wave our hands and say this is likely to wash out somewhat, too.

Now, if we had a nontamperable token available in all of (a) webrequest, (b) the RL JS payload, and (c) the event sent from the browser, that would be most ideal, probably (perhaps we need it for the HTML page, and the HTML page making the request for the RL needs to tag that in as well if we want to be more robust). Because we'd be able to avoid double counting and other problems. And we could make that independent of Experimentation Lab facilities if we really want to (for that matter we can just tabulate for Experimentation Lab if we want to see that its behavior is in line with other event logging ratios with such a counting scheme). One way to do this might be for RL to encode the value in a different bespoke header, so that Varnish doesn't have to crack open the RL payload, but rather can take that value and plop it into X-Analytics. Because the value can be in both the RL payload and in the bespoke payload we're then in okay shape. This doesn't play nicely for web cached RL resources because the origin RL isn't going to get pinged for every single client's JS fetch of such a resource (that would defeat the whole purpose of web caching). I can imagine how we could use ServerTiming (a la other data workarounds) at the Varnish edge to achieve something here so that if we really want to, then we're capable of having a pseudorandom value available emitted to the browser for an RL serve - and thus capable of being stitched into the event payload - and in the X-Analytics data. I'm aware of other headers we may be able to repurpose for client-side interpretation without using ServerTiming (at least for any contemporary UA), although maybe better to not go there as it's more stuff to think about.

It did come to mind that sleeping tabs and other fun browser behaviors could result in double counting of things (EDIT: that's probably what the LocalStorage related checklist item is about?). If there's a pseudorandom value, though, that can be cleared out in any Hive queries when using Hive-based mechanisms for counting. For simpler counting (i.e., in the application logging interfaces), that doesn't work super well, at least if it isn't confined down to require the data to not be received beyond a reasonable window (it's easy enough to put an expires-by type of value in a pseudorandom value for this purpose).

Now, over in T402994: Add Monitoring & Logging for Edge Uniques Event Requests there's discussion about looking at the various tiers where failure can happen and potentially putting in logging at the different places so that we can diagnose things and do so robustly. Wearing my infrastructure engineering hat I might think we want to put all the things in all the places, and then wearing the platform engineering hat I might say that's reasonable, and putting on the YAGNI hat I might say let's start with something fairly indicative of the thing we care about most: that JS-capable web browsing on webpages gets events all the way to the backend...because if we see that 95-99+% (whatever is seen empirically) of that is coming through crisply, we're probably okay in the aggregate (I'm aware some UAs for certain spatiotemporal factors may be more prone to problems, of course). Massive fluctuation on one ratio (e.g., Hive stored records divided by HTTP 200 RL payloads served that come bearing the instrument) that defies expected timeseries cycles is the thing that might trigger us to look more closely if things are going haywire and that may induce us to add more logging in more places (which exacts more maintenance toll if we're being robust). Now, as to the point of having confidence as a platform user (i.e., typically an application engineer and maybe a site reliability engineer conducting an A/B test), I suspect the webpage (or app) and stored record in Hive is the main point of interest for ratios - at least until we want to conduct experimentation that is more germane only to the network topology that sits in the middle or even farther out at the DNS/LVS/haproxy area (or, in case we see that there is apparent data loss we can't explain without investigation of these areas).

IIUC, I think @phuedx's point here is that, provided the user got something that looks like an HTML page (and thus has attendant ResourceLoader stuff, assuming we make the instrument a standard part of all ResourceLoader served for HTML pages), we could determine at the backend of EventGate in Hive stored records if the ratios match up. I think what @phuedx is saying is that because the request ID generated by Varnish isn't logged in webrequest (I think?) and it isn't (apparently / to our knowledge?) available to the JS client code that sends events, it wouldn't be possible to measure the ratio in a terribly reliable way. For dtracing, I don't see this sort of thing [EDIT: that is to say, the Varnish piece; obviously it's there for EventGate, MW PHP, and some other stuff) in place in Jaeger, although that may be partially because no one has gotten around to it yet...and I understand there are also maybe some historical considerations because of the full scale of web traffic even if sampled in a minimalist way (as an aide, for this matter, 1:128 sampling stuff might be a really nice case for tracking full-ish stack for bona fide HTML responses!).

Thanks for clarifying that, @dr0ptp4kt. Yes. That's what I was trying to say. Sorry.

Right, hm: I wrote:

I'm not certain that X-Request-Id is set by the CDN for pass-thru/beacon style requests. If it was, and if we also captured X-Request-Id in the webrequest dataset, you'd be able to correlate POSTs to eventgate in webrequest with events.

If it is not possible to have varnish add a different x-request-id into webrequest for pass-thru requests, then all above makes sense. Is that not possible?

Right, hm: I wrote:

I'm not certain that X-Request-Id is set by the CDN for pass-thru/beacon style requests. If it was, and if we also captured X-Request-Id in the webrequest dataset, you'd be able to correlate POSTs to eventgate in webrequest with events.

If it is not possible to have varnish add a different x-request-id into webrequest for pass-thru requests, then all above makes sense. Is that not possible?

It may be possible? I see the link in the comment to T221976: Have CDN edge set the `X-Request-Id` header for incoming external requests. I think there could be some questions about the degree to which we decouple it from other data fields. The original scope of this stuff was around Experimentation Lab, although I think a general purpose thing could work just as well to the point of confidence...maybe we'd just want some simple tag for A/B-based test events, as that's one part of the code we care about most, and knowing that the ratios for those A/B-based test events checked out just as well as the non-A/B test events (i.e., other event logging on the web) could be useful for confidence about the different effective code execution paths (both at the client and at the tiers of the servers, especially MediaWiki land and, transitively, EventGate).

If we don't want to to allow Jaeger traces or other forms of correlation (for various reasons), then we may need to iterate a bit. I actually would like to have Jaeger traces in some fashion, but I think we need to consider that HTML articles are prone to edge caching and so there's the initial page generation piece, and there's the case where an HTML page is being served from cache.

@phuedx and I meet with @Vgutierrez and @BBlack today and could discuss this some, I think (assuming we have time, and I think we should probably make the time). I also have a call with RelEng folks about other anomaly detection concerning the hypothesis I'm working on (which is more about finding instability that surfaces from A/B tests; and as a pie-in-the-sky stretch maybe finding when A/B tests become unstable due to other code/config pushes) and could bounce the idea off of them based on what we discuss with @Vgutierrez and @BBlack .

I know @cjming is considering weighing some pros and cons of where we might add counting facilities and the degree to which we might want to be able to correlate, as opposed to simply "trusting" the counts as conveyed at the various tiers (to the point of trust, any token value received from the client is the thing where we don't have the same level of control, obviously, although there can be ways to avoid accidental counting of stuff that seems relevant to bona find UA behavior that is doing a real interaction with the site that should be counted, as opposed to behavior that shouldn't be counted; of course we control things interior in our network).

Thanks all for the comments/info!

Based on a recent conversation with @Vgutierrez @BBlack @phuedx @dr0ptp4kt, seems like we already have the data to come up with a measurement/ratio that will give us a good enough signal for understanding event lossiness.

I'll try to summarize how we landed there and what our next steps might be after further discussion with Sam:


Problem to solve
Estimate client-side event loss: when a page view should trigger an event but it doesn’t reach the backend.

Disclaimer:
We all understand that it's impossible to capture every failure reason (browser crashes, machine crashes, network failures, etc causing loss of state) - these are unknowable, un-measureable, and to be expected.

BUT we still want to provide confidence while being aware that it’s a lossy pipeline. We need to run a big enough sample size so that event lossiness doesn’t matter. We can take advantage of the simplest measure first and then refine if we need more information.

Telemetry options

  • NEL (Network Error Logging) and Probenet already exist and likely provide “good-enough” signals without new identifiers.
  • NEL is Chrome/Chromium only, but that’s acceptable for trend detection. NEL can report errors and (sampled) successes.
    • Since it's implemented in the browser, no need to build anything new, no user script needing to be loaded so no performance penalty. Though it does not cover non-Chromium browsers (so excluded from analysis), there's a high degree of coverage as majority of traffic is included.
  • Counting evt-103e successes vs NEL error reports for that path yields an actionable failure ratio - watch for spikes rather than chase an absolute truth.
  • We can create Prometheus dashboards from existing NEL data.

IDs

  • No X-Request-ID at CDN logs today
  • Envoy does some IDs but too late for this use
  • HAProxy (which forwards to Kafka) has a monotonic sequence counter—not globally unique and resets on restart
  • A true UUID-style request ID via HAProxy is possible but is extra work.

Privacy & data design

  • Idea surfaced to propagate experiment group to webrequest to avoid an extra beacon—but privacy risks (small cohorts, re-identification) mean this needs security/privacy review and stronger xLab constraints (population minimums, duration caps, rate/volume limits).
  • Are there plans to ship more “edge-uniques” metadata to webrequest (age/frequency) for bot/quality analysis that are still privacy-safe?

Notable links

Next steps

  • Start with NEL-based ratio - calculate loss ratio (Chrome):
    • NEL_error_reports (evt-103e path) / (NEL_success_sample * (1/success_sampling_rate)) ?
    • NEL_error_reports / evt-103e_success_count (Chrome UA filtered) ?
  • Send NEL data to Prometheus/Grafana dashboard
  • Validate common failure modes (DNS blocks, uBlock Origin, extension blocking)
  • Implement proposal in T403507: Baseline rate of user agents that successfully load xLab client
    • Stage 1: fire beacon immediately inside the header in the context of an experiment from a client-side script via ResourceLoader
    • Stage 2: fire a second event when the page is loaded >> this metric is the successful load rate
    • From these events, we calculate a rate of the number of times loading failed
      • (Stage 1 - Stage 2) / Stage 1 ?
      • Stage 2 / (Stage 1 - number of times SDK loading failed) ?
  • Combo of NEL data + SDK load successes gives us an idea of how many events we expect to lose
  • Extract domains from the NEL data -- inquire with DPE if we can get the domain from these events:
    • Create an Airflow job?
    • Superset dashboard?
  • If we want more details after this, we'll consider revisiting IDs at HAProxy.

Change #1190981 had a related patch set uploaded (by Clare Ming; author: Clare Ming):

[mediawiki/extensions/MetricsPlatform@master] Add beacons based on config

https://gerrit.wikimedia.org/r/1190981

I set the patch ready for review even though I'm still trying to figure out if it's possible to send GET query string counters to Prometheus (actively researching).

While I try to sort that out and in the mean time, I figure the rest of the patch can be reviewed for both the image beacon and the xLab load event.

TESTING INSTRUCTIONS

I was able to test this by executing the following curl (or navigate to browser) and see corresponding counter increments in Thanos:

  • Run the following curl cmd:
curl -v 'https://en.wikipedia.org/beacon/statsv?mediawiki_page_load_head_pixel_total:1%7Cc'

Ginormous props and s/o to @CDanis for their generous knowledge sharing about how to grok NEL data, dashboarding, Prometheus, etc ⭐ 🌟 ✨

The main takeaway from Chris that I got was that we should be looking at Logstash or the data lake for NEL data rather than Prometheus which is not advised for high cardinality like NEL data. Since we want to drill down on evt-103e per wiki URI hosts to understand event loss from experiments, particularly for anonymous traffic, we can filter by URL for that partial path - here are results from the last 7 days:
https://logstash.wikimedia.org/goto/cc7b7c396888cdab68af7ec38b6e5371

If we want to ignore high age in milliseconds (i.e. browser having larger connectivity issues), the numbers are even less (last 7 days):
https://logstash.wikimedia.org/goto/0d52f658bcc78063bd6e74f4c1f40753

This data is also subject to the number of experiments we have running at any given time. So currently, we have 1 experiment running for logged-out users - https://mpic.wikimedia.org/api/v1/experiments?format=config&authority=varnish

The logstash dashboard shows very few errors from the last 7 days (NEL report age in milliseconds between 0 and 60,000):

Screenshot 2025-09-26 at 6.04.00 PM.png (478×1 px, 74 KB)

Right now with the minimal number of experiments we are running, we don't have sufficient data to really make conclusions about errors. SRE sets a failure_fraction of 0.05 or 5% of failures that occurred on NEL-implementing browsers which is a small sample size given how few experiments are active. For example, last week there were 17 reports - with 5% being 1/20, multiplying 17 * 20 to give us an approximation of cumulative errors in that week would be 340. I would feel more comfortable making those calculations if we had a lot more data to work with. Over time though I think we'll get there.

Apparently if we want to have this data in Prometheus and be able to alert on it, once we have the queries, we can get the exporter microservice (statsd-exporter) to make metrics for us too.

The main takeaway from Chris that I got was that we should be looking at Logstash or the data lake for NEL data rather than Prometheus which is not advised for high cardinality like NEL data.

Re: the Data Lake as an option, work has begun on T304373: Also intake Network Error Logging events into the Analytics Data Lake which is very exciting ✨

Apparently if we want to have this data in Prometheus and be able to alert on it, once we have the queries, we can get the exporter microservice (statsd-exporter) to make metrics for us too.

Slight correction: for exporting NELs matching a certain query to Prometheus, you'd be using prometheus-es-exporter, writing a config file kinda like this one.

Change #1190981 merged by Clare Ming:

[mediawiki/extensions/MetricsPlatform@master] Send signals for initial page load and xLab load

https://gerrit.wikimedia.org/r/1190981

@Milimetric @phuedx cc @JVanderhoop-WMF where, how much (sample rate), how long do we want to have this running?

I set up the related experiment in xLab: https://mpic.wikimedia.org/experiment/xlab-mw-module-loaded
with the following config:

  • Traffic set to 0.1% on enwiki
  • Collect data for a month (?) starting 10/31 (we can start sooner but figure we need some config deployed on the target wikis)

Do we expect data loss to differ across wikis? Ideally we would be able to geolocate to have a better idea of global patterns of data loss; at this point our best proxy for geography is to cover a diversity of languages.

@mpopov to help compile a list of wikis

@cjming okay, @JVanderhoop-WMF and I looked at some data and came up with this list of language editions of Wikipedia:

  • Persian (fawiki)
  • Afrikaans (afwiki)
  • Arabic (arwiki)
  • Armenian (hywiki)
  • Bulgarian (bgwiki)
  • Burmese (mywiki)
  • Hebrew (hewiki)
  • Japanese (jawiki)
  • Italian (itwiki)
  • Azerbaijani (azwiki)
  • Portuguese (ptwiki)
  • Nepali (newiki)
  • Vietnamese (viwiki)

Let's enroll 10% on those

Added those wikis to the experiment

Change #1197344 had a related patch set uploaded (by Clare Ming; author: Clare Ming):

[operations/mediawiki-config@master] Add config for xLab MW Module experiment

https://gerrit.wikimedia.org/r/1197344

Added those wikis to the experiment

tysm!

Example query for counting requests to statsd beacon from server-side instrumentation:

SELECT COUNT(1) AS beacon_request_count
FROM wmf.webrequest 
WHERE webrequest_source = 'text'
  AND year = 2025 AND month = 10 AND day = 22 AND hour = 1
  AND starts_with(uri_path, '/beacon/statsv')
  AND starts_with(uri_query, '?mediawiki_page_load_head_pixel_total')

Example query for counting events from client-side instrumentation:

SELECT COUNT(1) AS event_count
FROM event.product_metrics_web_base
WHERE year = 2025 AND month = 10 AND day = 22 AND hour = 1
  AND experiment.enrolled = 'xlab-mw-module-loaded'
  AND action = 'xLab-loaded'

Change #1197344 merged by jenkins-bot:

[operations/mediawiki-config@master] Add config for xLab MW Module experiment

https://gerrit.wikimedia.org/r/1197344

Mentioned in SAL (#wikimedia-operations) [2025-10-23T20:03:15Z] <cjming@deploy2002> Started scap sync-world: Backport for [[gerrit:1197344|Add config for xLab MW Module experiment (T401705)]]

Mentioned in SAL (#wikimedia-operations) [2025-10-23T20:07:32Z] <cjming@deploy2002> cjming: Backport for [[gerrit:1197344|Add config for xLab MW Module experiment (T401705)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there.

Mentioned in SAL (#wikimedia-operations) [2025-10-23T20:14:38Z] <cjming@deploy2002> Finished scap sync-world: Backport for [[gerrit:1197344|Add config for xLab MW Module experiment (T401705)]] (duration: 11m 23s)

Change #1198591 had a related patch set uploaded (by Clare Ming; author: Clare Ming):

[operations/mediawiki-config@master] Add config for xLab MW Module experiment

https://gerrit.wikimedia.org/r/1198591

Change #1198591 merged by jenkins-bot:

[operations/mediawiki-config@master] Add config for xLab MW Module experiment

https://gerrit.wikimedia.org/r/1198591

Mentioned in SAL (#wikimedia-operations) [2025-10-27T20:05:56Z] <cjming@deploy2002> Started scap sync-world: Backport for [[gerrit:1198591|Add config for xLab MW Module experiment (T401705)]], [[gerrit:1198949|hCaptcha: Define HCaptchaSiteKey in CommonSettings.php (T405586)]]

Mentioned in SAL (#wikimedia-operations) [2025-10-27T20:10:01Z] <cjming@deploy2002> kharlan, cjming: Backport for [[gerrit:1198591|Add config for xLab MW Module experiment (T401705)]], [[gerrit:1198949|hCaptcha: Define HCaptchaSiteKey in CommonSettings.php (T405586)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there.

Mentioned in SAL (#wikimedia-operations) [2025-10-27T20:21:27Z] <cjming@deploy2002> Finished scap sync-world: Backport for [[gerrit:1198591|Add config for xLab MW Module experiment (T401705)]], [[gerrit:1198949|hCaptcha: Define HCaptchaSiteKey in CommonSettings.php (T405586)]] (duration: 15m 31s)

Here is the prometheus data in thanos for the mediawiki_page_load_head_pixel_total counter (this is the image beacon that gets sent if users are in the treatment group of xlab-mw-module-loaded) from the last day i.e. 10/29/25:

https://thanos.wikimedia.org/graph?g0.expr=mediawiki_page_load_head_pixel_total&g0.tab=0&g0.stacked=0&g0.range_input=1d&g0.max_source_resolution=0s&g0.deduplicate=1&g0.partial_response=1&g0.store_matches=%5B%5D&g0.engine=prometheus&g0.analyze=0&g0.tenant=

Screenshot 2025-10-29 at 11.10.39 PM.png (1×2 px, 407 KB)

Screenshot 2025-10-29 at 11.10.22 PM.png (1×2 px, 261 KB)

I also ran the following query in the event table in Hive:

SELECT count(action)
FROM event.product_metrics_web_base
WHERE instrument_name = 'XLabMediaWikiModuleLoaded' AND year = 2025 AND month = 10 AND day = 29;

which returned 3028810

I need to figure out if the time frame/window is close enough between the two to be able to compare and if I'm reading the data correctly but right now, it looks like in the last day, there are around 4.6M events for mediawiki_page_load_head_pixel_total (image beacon) compared to ~3M of validated events related to the xlab-loaded action for the XLabMediaWikiModuleLoaded experiment.

Even if these numbers are in the ballpark, I'm a little flummoxed as to how to interpret this data. Does it seem reasonable that of 4.6M initial page loads, less than 3/4 of those page visits result in the JS SDK being loaded?

As for NELs (remember that NEL applies to chromium browsers only which is the vast majority of traffic fwiw), when I look at the last 24 hours, those numbers seem negligible: https://logstash.wikimedia.org/goto/644fe81464747c09c2f1e5fed6795819
{F68650998}

Lastly, there are a number of EventGate validation errors related to this experiment - https://logstash.wikimedia.org/goto/ed4ee4253701e229baf5f3b58d88a301

About ~2500 errors in the last 24 hours.

Error message:

"Could not hoist data into experiment.subject_id for event event of schema at /analytics/product_metrics/web/base/1.4.2 destined to stream product_metrics.web_base: x-experiment-enrollments header does not have a matching enrollment in event data: One of 'experiment.enrolled' or 'experiment.assigned' fields does not have matching experiment or group name in header."

Here is the JSON from a sample errored event:

{
  "action": "xLab-loaded",
  "instrument_name": "XLabMediaWikiModuleLoaded",
  "experiment": {
    "enrolled": "xlab-mw-module-loaded",
    "assigned": "treatment",
    "subject_id": "awaiting",
    "sampling_unit": "edge-unique",
    "coordinator": "xLab"
  },
  "$schema": "/analytics/product_metrics/web/base/1.4.2",
  "agent": {
    "client_platform": "mediawiki_js",
    "client_platform_family": "desktop_browser"
  },
  "performer": {
    "is_logged_in": false,
    "pageview_id": "a16dd1106fe4c30ebf3b",
    "active_browsing_session_token": "ed276bb5711aa4dcb224"
  },
  "mediawiki": {
    "skin": "vector-2022",
    "database": "bgwiki"
  },
  "dt": "2025-10-30T07:06:28.243Z",
  "meta": {
    "stream": "product_metrics.web_base",
    "domain": "bg.wikipedia.org",
    "id": "17f4eb0d-a642-466f-8dd9-b0949b618ba8",
    "dt": "2025-10-30T07:06:31.540Z",
    "request_id": "efba9704-fbea-4b98-a8a7-fb6778c2905b"
  }
}

Will dig into this and try to figure out what's going on.

For the following, @cjming and I were corresponding via chat. These are thought to relate to cookies expiring at just the wrong time, although could pertain to a number of other circumstances where something with the cookie is just ever so slightly off and so header enrichment and subsequent hoisting into the payload is also off. When these errors occur at the end of an A/B test that's more expected, too, although here the A/B test (which is for data baselining purposes) is in flight.

Looking at just one wiki during a somewhat less busy time of a few hours for example : it looked like we may have been getting one or two or three of these per hour for bgwiki based on Logstash at https://logstash.wikimedia.org/goto/ce1d23e90dd9a1d6f7369679ea0cf74b whereas it seemed like we were getting quite a lot more successful events coming through (which is plausible enough overall behavior given @cjming has noted (maybe around 0.08% ratio if taking 2500 / 4.6m; this would be higher in this window on bgwiki, apparently)).

spark-sql (default)> select hour, count(*) from event.product_metrics_web_base where year = 2025 and month = 10 and day = 29 and hour < 3  and instrument_name = 'XLabMediaWikiModuleLoaded' and mediawiki.database = 'bgwiki' group by hour;
hour	count(1)
0	402
2	197
1	302

There is some moderately complex C code that underpins some pieces, and I'll check on group chat about if there may be cases where we want to further prevent the low level hum of these cases.

...
Lastly, there are a number of EventGate validation errors related to this experiment - https://logstash.wikimedia.org/goto/ed4ee4253701e229baf5f3b58d88a301

About ~2500 errors in the last 24 hours.

Error message:

"Could not hoist data into experiment.subject_id for event event of schema at /analytics/product_metrics/web/base/1.4.2 destined to stream product_metrics.web_base: x-experiment-enrollments header does not have a matching enrollment in event data: One of 'experiment.enrolled' or 'experiment.assigned' fields does not have matching experiment or group name in header."

Here is the JSON from a sample errored event:

{
  "action": "xLab-loaded",
  "instrument_name": "XLabMediaWikiModuleLoaded",
  "experiment": {
    "enrolled": "xlab-mw-module-loaded",
    "assigned": "treatment",
    "subject_id": "awaiting",
    "sampling_unit": "edge-unique",
    "coordinator": "xLab"
  },
  "$schema": "/analytics/product_metrics/web/base/1.4.2",
  "agent": {
    "client_platform": "mediawiki_js",
    "client_platform_family": "desktop_browser"
  },
  "performer": {
    "is_logged_in": false,
    "pageview_id": "a16dd1106fe4c30ebf3b",
    "active_browsing_session_token": "ed276bb5711aa4dcb224"
  },
  "mediawiki": {
    "skin": "vector-2022",
    "database": "bgwiki"
  },
  "dt": "2025-10-30T07:06:28.243Z",
  "meta": {
    "stream": "product_metrics.web_base",
    "domain": "bg.wikipedia.org",
    "id": "17f4eb0d-a642-466f-8dd9-b0949b618ba8",
    "dt": "2025-10-30T07:06:31.540Z",
    "request_id": "efba9704-fbea-4b98-a8a7-fb6778c2905b"
  }
}

Will dig into this and try to figure out what's going on.

Lastly, there are a number of EventGate validation errors related to this experiment - https://logstash.wikimedia.org/goto/ed4ee4253701e229baf5f3b58d88a301

About ~2500 errors in the last 24 hours.

Error message:

"Could not hoist data into experiment.subject_id for event event of schema at /analytics/product_metrics/web/base/1.4.2 destined to stream product_metrics.web_base: x-experiment-enrollments header does not have a matching enrollment in event data: One of 'experiment.enrolled' or 'experiment.assigned' fields does not have matching experiment or group name in header."

<snip />

Will dig into this and try to figure out what's going on.

The error is thrown when EventGate is trying to extract the subject ID for the experiment after the header has been parsed into experiment name/group name/subject ID 3-tuples. The header is valid but doesn't contain the experiment name/group name 2-tuple. Hrrm.

Change #1202794 had a related patch set uploaded (by Clare Ming; author: Clare Ming):

[mediawiki/extensions/MetricsPlatform@master] Re-run xLab MW Module Loaded experiment v2

https://gerrit.wikimedia.org/r/1202794

Change #1202794 merged by jenkins-bot:

[mediawiki/extensions/MetricsPlatform@master] Re-run xLab MW Module Loaded experiment v2

https://gerrit.wikimedia.org/r/1202794

Change #1202807 had a related patch set uploaded (by Clare Ming; author: Clare Ming):

[mediawiki/extensions/MetricsPlatform@wmf/1.46.0-wmf.1] Re-run xLab MW Module Loaded experiment v2

https://gerrit.wikimedia.org/r/1202807

Change #1202807 merged by jenkins-bot:

[mediawiki/extensions/MetricsPlatform@wmf/1.46.0-wmf.1] Re-run xLab MW Module Loaded experiment v2

https://gerrit.wikimedia.org/r/1202807

Mentioned in SAL (#wikimedia-operations) [2025-11-06T23:02:29Z] <cjming@deploy2002> Started scap sync-world: Backport for [[gerrit:1201826|Use wikimedia.org as the "server" for the wiki-agnostic RESTbase specs]], [[gerrit:1202766|Use prefixed 'sub' field in OAuth 2 access tokens on beta cluster (T399199)]], [[gerrit:1202807|Re-run xLab MW Module Loaded experiment v2 (T401705)]]

Mentioned in SAL (#wikimedia-operations) [2025-11-06T23:04:47Z] <cjming@deploy2002> cjming, tgr, aaron: Backport for [[gerrit:1201826|Use wikimedia.org as the "server" for the wiki-agnostic RESTbase specs]], [[gerrit:1202766|Use prefixed 'sub' field in OAuth 2 access tokens on beta cluster (T399199)]], [[gerrit:1202807|Re-run xLab MW Module Loaded experiment v2 (T401705)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there

Mentioned in SAL (#wikimedia-operations) [2025-11-06T23:11:02Z] <cjming@deploy2002> Finished scap sync-world: Backport for [[gerrit:1201826|Use wikimedia.org as the "server" for the wiki-agnostic RESTbase specs]], [[gerrit:1202766|Use prefixed 'sub' field in OAuth 2 access tokens on beta cluster (T399199)]], [[gerrit:1202807|Re-run xLab MW Module Loaded experiment v2 (T401705)]] (duration: 08m 34s)

I wrote up a report that hopefully captures some observations, insights, and next steps:
Traffic, Instrumentation, Experiment, and Data Analysis for Event Loss << questions, comments, challenges welcome (anyone at WMF can comment - will open up for editing before publishing (Wikitech?)

It felt too long to paste into a Phab comment -- I closed the related hypothesis for this body of work based on the findings of the report which outlines some questions and next steps we can pursue if there is appetite to understand more.

As mentioned, I will also publish this actionable intelligence somewhere so it doesn't just live in a doc

closing this ticket as finally done for now