Page MenuHomePhabricator

Implement debugging for events in the PHP SDK
Closed, ResolvedPublic3 Estimated Story Points

Description

Description

Add counters and debugging in the PHP SDK to measure total events sent vs. events effectively produced in order to estimate event-loss rates.

  • Emit a new counter called experiment_events_sent_total from the PHP SDK using StatsLib to be stored in Prometheus
  • Make calculation in Prometheus to determine events_lost_total and events_loss_rate:
`experiment_events_loss_rate` = ( `experiment_events_sent_total` - event produce rate for `product_metrics.web_base` ) / `experiment_events_sent_total`

Technical Notes

(attribution to @aaron for articulating the following)

  • Try to minimize both false negatives (missed losses) and false positives (counting valid events as losses)
  • Timeouts & DeferredUpdates failures in MediaWiki
    • Updates may fail due to unrelated factors (request timeouts, corrupted DB handles, slow HTTP flushes, etc.):
      • Timeouts from client/proxy layers
      • Non-definitive responses (e.g., 202 from EventGate - i.e. proxy buffers something for storage to read back end)
      • Slow HTTP response output generation
      • Slow flushing of the output
      • Other hook handlers and DomainEvent emissions might add on to the DeferredUpdates queue.
      • Events can be emitted inside/outside of RDBMS transactions
        • Prior updates corrupts the session state of a needed RDBMS connection handle
        • If a caller wants to use the PHP SDK to emit an event, it could happen in the middle of a relevant rdbms transaction round. Some events might not care whether the transaction commits, but many should only emit if the transaction commits.
          • MWCallableUpdate helps tie events to transaction commits, though COMMIT timeouts/connection resets remain ambiguous. It treats COMMITs timeouts or connection loss with TCP reset (rare as they are) as ROLLBACK even if technically the COMMIT may have succeeded in theory
      • The more stuff that happens between the update being queued and run, the more likely problems, and not knowing what will run in that time
  • Need a reliable storage system for loss metrics
    • Prometheus Integration: Use StatsLib/StatsFactory for counters. Assume sending metrics via UDP → statsd-exporter is reliable

Acceptance Criteria

  • Decide on core metrics labels:
    • experiment_events_sent_total
    • experiment_events_lost_total
    • experiment_events_loss_rate
  • Prometheus Storage Integration (similar to Javascript - T401705: Implement debugging for events in the Javascript SDK)
    • Confirm Prometheus as storage backend
    • Implement counters in StatsLib for PHP SDK emissions
    • Validate UDP → statsd-exporter reliability presumed reliable

Extra credit:

  • DeferredUpdates & Transaction Handling
    • Audit event emission inside RDBMS transactions
    • Use MWCallableUpdate where events must follow COMMIT success
    • Document risks for COMMIT timeout/connection reset edge cases

Event Timeline

Milimetric triaged this task as Medium priority.Aug 12 2025, 3:42 PM
Milimetric moved this task from Incoming to Backlog on the Test Kitchen board.
JVanderhoop-WMF raised the priority of this task from Medium to High.Aug 19 2025, 3:34 PM

IIRC, these events are submitted to eventgate-analytics-external via the EventBus extension. You might have some of what you need already. Check

You can change the stream names selected accordingly to filter for just ExP streams.

Nice! The metrics very likely already exist!


From the task description:

  • Need a reliable storage system for loss metrics
    • Prometheus Integration: Use StatsLib/StatsFactory for counters. Assume sending metrics via UDP → statsd-exporter is reliable

and then:

Validate UDP → statsd-exporter reliability

These lines are contradictory.

FWIW I agree with @aaron that it's sufficient to assume that sending metrics via UDP to statsd-exporter is reliable. We've made this assumption across all components in the Experiment Platform.

tysm @Ottomata for the links! I'm trying to grok the dashboards and wondering if we do in fact have most of what's needed

EventBus metrics for eventgate_analytics_external EventGate intake service (drilled down to product_metrics.web_base) instances for the last 7 days:
https://grafana.wikimedia.org/goto/3XmXi_9NR?orgId=1

Screenshot 2025-09-09 at 7.27.52 AM.png (1×2 px, 686 KB)

Some Qs:

  • Is it odd that in terms of throughput, accepted is greater than outgoing?
  • that ratio (outgoing/accepted) being almost even means we have a near perfect throughput?

Event produce rate for product_metrics.web_base:
https://grafana.wikimedia.org/goto/ZbW4Zl9HR?orgId=1

As more experiments roll on/off, max/mean numbers will obviously fluctuate. For the core metrics, I'm intuiting that the produce rate would constitute the difference between events_sent_total and events_lost_total so if we can capture events_sent_total from the PHP SDK, then we can subtract this event produce rate from this to get events_lost_total with a ratio over time called events_loss_rate:

events_loss_rate = (events_sent_total - events_lost_total) / events_sent_total ?
or
events_loss_rate = event produce rate for product_metrics.web_base / events_sent_total ?

FWIW, EventBus metrics are emitted here. Docs on these metrics could be better.

Is it odd that in terms of throughput, accepted is greater than outgoing?

If the number were way off, I'd say yes. But (IIUC, and I might not!) those numbers are a sum of an instantaneous rate of change in a 5 minute window. Prom query is e.g.

sum(irate(mediawiki_EventBus_events_outgoing_total{event_service_name=~"$service", event_type=~"$event_type"}[5m]))

Metrics values are emitted and scraped periodically, so even the time at which the values are collected by prometheus for different metrics could matter.

Generally they should be the same-ish. Close numbers I'd assume are just due to Prometheus processing stuff. But way off numbers would indicate something is wrong. IIRC, I think outgoing is what EventBus sends, accepted is the number of events EventGate 'accepted' and will then attempt to produce to Kafka (or did produce to kafka, depending on the producer type used.

I'm pretty sure that Eventlogging extension (which I think ExP/MP still use?) always uses the hasty producer, so 'accepted' would just mean that EventGate received the POST of events, not that it actually validated or produced the events to Kafka.

that ratio (outgoing/accepted) being almost even means we have a near perfect throughput?

Yes, and that should be the normal case. Regular loss should be quite rare, and probably wouldn't show up in an overview panel like this.

if we can capture events_sent_total from the PHP SDK

Hm, Do you think EventBus send() method needs a metric before outgoing is calculated? Reading code it looks like send() might fail before this, e.g. if you passed the events as a string and they could not be JSON decoded.

In practice though, unless there is something between ExP PHP SDK's submit(?) and EventBus send() call you want to account for, I'd expect EventBus outgoing metric(s) to be equivalent to whatever you give ExP submit (?). Reading the EventBus send() code, it doesn't have any logic relevant to your events before it increments the outgoing counters.

if we can capture events_sent_total from the PHP SDK

But! sure! If you want ExP to track events_sent_total (why not!?)

BTW, if you want to be more precise, the message produce rate of the in the underlying Kafka topics is closer to reality than EventGate's per stream produce rate. E.g. https://grafana.wikimedia.org/goto/S6_5ylrHR?orgId=1. But, in practice, EventGate's per stream produce rate is probably easier to reason with, because you don't have to aggregate over multiple topics :)

then we can subtract this event produce rate from this to get events_lost_total

Sure! But I don't think you could make this an actual prometheus metric, right? Your PHP SDK wouldn't know about events_lost_total, this would just be a calculation you'd make in Prometheus? E.g. ExP exp_events_sent_total - eventgate_produce_rate == exp_events_lost_total, right?

I'd be careful with relying on it as a source of truth for # of events lost because A. you can't ever be 100% sure about exp_events_sent_total and B. the systems emitting the metrics are different, and each are being scraped/sampled at different times.

But ya should work for alerting!

(Apologies if I have errors above, I typed this quickly without much review because AHHH SO BUSY! :)

thanks again @Ottomata for your responses -- ya, this is just a calculation we will make in Prometheus. Understand the perils of relying on this metric as source of truth and appreciate the forewarning.

We're going to go ahead and set this up from our side so we can track too with the anticipation that you mention - that our exp_events_sent_total will have close if not exact parity with EventBus' outgoing metric.

cjming updated the task description. (Show Details)

Change #1188502 had a related patch set uploaded (by Clare Ming; author: Clare Ming):

[mediawiki/extensions/MetricsPlatform@master] Add counter to PHP SDK when sending events.

https://gerrit.wikimedia.org/r/1188502

waiting for some updates on Growth Experiments' tests which are breaking with the introduction of StatsFactory as a constructor parameter for Experiment and ExperimentManager in https://gerrit.wikimedia.org/r/1188502

Change #1194239 had a related patch set uploaded (by Sergio Gimeno; author: Sergio Gimeno):

[mediawiki/extensions/GrowthExperiments@master] tests: temporary disable ExperimentXLabManager tests

https://gerrit.wikimedia.org/r/1194239

Change #1194239 merged by jenkins-bot:

[mediawiki/extensions/GrowthExperiments@master] tests: temporary disable ExperimentXLabManager tests

https://gerrit.wikimedia.org/r/1194239

Change #1194571 had a related patch set uploaded (by Sergio Gimeno; author: Sergio Gimeno):

[mediawiki/extensions/GrowthExperiments@master] tests(ExperimentXLabManager): update and renable the tests

https://gerrit.wikimedia.org/r/1194571

Change #1188502 merged by jenkins-bot:

[mediawiki/extensions/MetricsPlatform@master] Add counter to PHP SDK when sending events.

https://gerrit.wikimedia.org/r/1188502

Change #1194571 merged by jenkins-bot:

[mediawiki/extensions/GrowthExperiments@master] tests(ExperimentXLabManager): update and renable the tests

https://gerrit.wikimedia.org/r/1194571

To ensure we have enough data for the mediawiki_MetricsPlatform_experiment_events_sent_total prometheus counter, I updated the dates for Logged-in Synthetic A/A Test (PHP SDK) to run starting now (10/29/25) through 11/6/25, basically a week.

Here are events immediately showing up in Thanos - https://thanos.wikimedia.org/graph?g0.expr=mediawiki_MetricsPlatform_experiment_events_sent_total&g0.tab=1&g0.stacked=0&g0.range_input=1h&g0.max_source_resolution=0s&g0.deduplicate=1&g0.partial_response=1&g0.store_matches=%5B%5D&g0.engine=prometheus&g0.analyze=0&g0.tenant=:

Screenshot 2025-10-29 at 10.48.36 PM.png (2×3 px, 638 KB)

This proves that data collection is underway and the counter is working - I will let this run for the week but I think we can call this done or close to done.

The remaining work after collecting data for a bit is to calculate in a specified date range the amount of events we see in product_metrics.web_base for the PageVisit instrument and page-visited events which has to be derived by manual queries.

I ran the following query in Hive:

SELECT count(*)
FROM event.product_metrics_web_base
WHERE action = 'page-visited' AND instrument_name = 'PageVisit' AND year = 2025 AND month = 10 AND day = 30;

which returned 8419

Since the experiment just got turned back on, I will run this query again once we have a full day's worth of data (i.e. 10/30/25 or later).

This comment was removed by cjming.

I wrote up a report that hopefully captures some observations, insights, and next steps:
Traffic, Instrumentation, Experiment, and Data Analysis for Event Loss << questions, comments, challenges welcome (anyone at WMF can comment - will open up for editing before publishing (Wikitech?)

It felt too long to paste into a Phab comment -- I closed the related hypothesis for this body of work based on the findings of the report which outlines some questions and next steps we can pursue if there is appetite to understand more.

As mentioned, I will also publish this actionable intelligence somewhere so it doesn't just live in a doc

closing this ticket as finally done for now