Page MenuHomePhabricator

[Event Platform] Instrument EventBus with prometheus MW Statslib
Open, Needs TriagePublic

Description

There are discussions about occasions where EventBus fails producing events to EventGate.

etc.

We have some metrics from envoy mesh local proxies and from eventgate when things fail. (See Errors section of the EventGate grafana dashboard).

However, we do not have these metrics from EventBus itself. We do have failure logs in logstash.

Especially on 5xx errors, the client will know best when they happen.

Now that MediaWiki has prometheus support (T350592: EPIC: migrate in use metrics and dashboards to statslib), we should instrument EventBus and add metrics around event production and whatever else might be nice to have.

https://www.mediawiki.org/wiki/Manual:Stats has instructions for how to use the MW Stats library to do this.

Doing this will help us quantify when we fail to produce events, which will help us with defining SLOs and documentation for T120242: Eventually Consistent MediaWiki State Change Events.

Done is
  • EventBus emits metrics about event produce and failure counters, with informative labels. Labels should probably include
  • stream name
  • event service name (eventgate name)
  • maybe $schema if it isn't hard to get?
  • HTTP status code (?)
  • etc.
  • Any other easy/useful/relavent EventBus metrics are emitted.
  • EventBus metrics are shown in a Grafana dashboard, either in the existent EventGate one, or a new one for EventBus.