Description
Add counters and debugging in the PHP SDK to measure total events sent vs. events effectively produced in order to estimate event-loss rates.
- Emit a new counter called experiment_events_sent_total from the PHP SDK using StatsLib to be stored in Prometheus
- Make calculation in Prometheus to determine events_lost_total and events_loss_rate:
`experiment_events_loss_rate` = ( `experiment_events_sent_total` - event produce rate for `product_metrics.web_base` ) / `experiment_events_sent_total`
Technical Notes
(attribution to @aaron for articulating the following)
- Try to minimize both false negatives (missed losses) and false positives (counting valid events as losses)
- Timeouts & DeferredUpdates failures in MediaWiki
- Updates may fail due to unrelated factors (request timeouts, corrupted DB handles, slow HTTP flushes, etc.):
- Timeouts from client/proxy layers
- Non-definitive responses (e.g., 202 from EventGate - i.e. proxy buffers something for storage to read back end)
- Slow HTTP response output generation
- Slow flushing of the output
- Other hook handlers and DomainEvent emissions might add on to the DeferredUpdates queue.
- Events can be emitted inside/outside of RDBMS transactions
- Prior updates corrupts the session state of a needed RDBMS connection handle
- If a caller wants to use the PHP SDK to emit an event, it could happen in the middle of a relevant rdbms transaction round. Some events might not care whether the transaction commits, but many should only emit if the transaction commits.
- MWCallableUpdate helps tie events to transaction commits, though COMMIT timeouts/connection resets remain ambiguous. It treats COMMITs timeouts or connection loss with TCP reset (rare as they are) as ROLLBACK even if technically the COMMIT may have succeeded in theory
- The more stuff that happens between the update being queued and run, the more likely problems, and not knowing what will run in that time
- Updates may fail due to unrelated factors (request timeouts, corrupted DB handles, slow HTTP flushes, etc.):
- Need a reliable storage system for loss metrics
- Prometheus Integration: Use StatsLib/StatsFactory for counters. Assume sending metrics via UDP → statsd-exporter is reliable
Acceptance Criteria
- Decide on core metrics labels:
- experiment_events_sent_total
- experiment_events_lost_total
- experiment_events_loss_rate
- Prometheus Storage Integration (similar to Javascript - T401705: Implement debugging for events in the Javascript SDK)
- Confirm Prometheus as storage backend
- Implement counters in StatsLib for PHP SDK emissions
-
Validate UDP → statsd-exporter reliabilitypresumed reliable
Extra credit:
- DeferredUpdates & Transaction Handling
- Audit event emission inside RDBMS transactions
- Use MWCallableUpdate where events must follow COMMIT success
- Document risks for COMMIT timeout/connection reset edge cases

