Description
Estimate client-side event loss: when a page view should trigger an event but it doesn’t reach the backend.
- Start with NEL-based ratio - calculate loss ratio (Chrome):
NEL_error_reports (evt-103e path) / (NEL_success_sample * (1/success_sampling_rate)) ? // or NEL_error_reports / evt-103e_success_count (Chrome UA filtered) ?
- Send NEL data to Prometheus/Grafana dashboard
- Validate common failure modes (DNS blocks, uBlock Origin, extension blocking)
- Implement proposal in T403507: Baseline rate of user agents that successfully load xLab client
- Stage 1: fire beacon immediately inside the header in the context of an experiment from a client-side script via ResourceLoader
- Stage 2: fire a second event when the page is loaded >> this metric is the successful load rate
- From these events, we calculate a rate of the number of times loading failed:
(Stage 1 - Stage 2) / Stage 1 ? // or Stage 2 / (Stage 1 - number of times SDK loading failed) ?
- The combination of NEL data + SDK load successes will give us an idea of how many events we expect to lose.
- Extract domains from the NEL data via Airflow job or Superset dashboard
Technical Notes
Telemetry options
- NEL (Network Error Logging) and Probenet already exist and likely provide “good-enough” signals without new identifiers.
- NEL is Chrome/Chromium only, but that’s acceptable for trend detection. NEL can report errors and (sampled) successes.
- Since it's implemented in the browser, no need to build anything new, no user script needing to be loaded so no performance penalty. Though it does not cover non-Chromium browsers (so excluded from analysis), there's a high degree of coverage as majority of traffic is included.
- Counting evt-103e successes vs NEL error reports for that path yields an actionable failure ratio - watch for spikes rather than chase an absolute truth.
- We can create Prometheus dashboards from existing NEL data.
Additional Notes
- JS SDK → Gateway needed to get JS metrics into Prometheus.
- Requires validation and LocalStorage retry/buffering.
- Error stream: w3c.reportingapi.network_error - https://grafana.wikimedia.org/d/ZB39Izmnz/eventgate?orgId=1&from=now-3h&to=now&timezone=utc&var-service=eventgate-logging-external&var-stream=w3c.reportingapi.network_error&var-kafka_broker=$__all&var-kafka_producer_type=$__all&var-dc=000000026&var-site=$__all&refresh=5m
- Error stream definition: https://gerrit.wikimedia.org/g/operations/mediawiki-config/+/d7558525f2201d2abbf64c7ebf219238ff96d2c1/wmf-config/ext-EventStreamConfig.php#1253
- Relevant VCL code: https://gerrit.wikimedia.org/g/operations/puppet/+/production/modules/varnish/templates/wikimedia-frontend.vcl.erb#169
- NEL logstash dashboard: https://logstash.wikimedia.org/app/dashboards#/view/ee6432c0-82a9-11eb-9d45-739221ba7fb6?_g=h@08f6778&_a=h@2327450
Acceptance Criteria
- Design gateway for JS SDK → Prometheus ingestion
- Implement counters in StatsLib for Javascript SDK emissions
- Add validation/trust model to prevent garbage data
- Implement LocalStorage buffering/flushing & retry logic


