Page MenuHomePhabricator

[Event Platform] Instrument EventBus with prometheus MW Statslib
Open, Needs TriagePublic13 Estimated Story Points

Description

There are discussions about occasions where EventBus fails producing events to EventGate.

etc.

We have some metrics from envoy mesh local proxies and from eventgate when things fail. (See Errors section of the EventGate grafana dashboard).

However, we do not have these metrics from EventBus itself. We do have failure logs in logstash.

Especially on 5xx errors, the client will know best when they happen.

Now that MediaWiki has prometheus support (T350592: EPIC: migrate in use metrics and dashboards to statslib), we should instrument EventBus and add metrics around event production and whatever else might be nice to have.

https://www.mediawiki.org/wiki/Manual:Stats has instructions for how to use the MW Stats library to do this.

Doing this will help us quantify when we fail to produce events, which will help us with defining SLOs and documentation for T120242: Eventually Consistent MediaWiki State Change Events.

Done is
  • EventBus emits metrics about event produce and failure counters, with informative labels. Labels should probably include
  • stream name
  • event service name (eventgate name)
  • maybe $schema if it isn't hard to get?
  • HTTP status code (?)
  • etc.
  • Any other easy/useful/relavent EventBus metrics are emitted.
  • EventBus metrics are shown in a Grafana dashboard, either in the existent EventGate one, or a new one for EventBus.

Event Timeline

@gmodena I'm considering a change in EventBus for which I'd need to know stream name in the EventBus send() method. I think you said you'd need this too?

We have $events in send(). meta.stream is a pretty strongly enforced convention; I don't think EventBus has ever not supported it. What if we just extracted it from the $events inside of send()?

...Or we could consider changing the method signature to include $streamName ?

@gmodena I'm considering a change in EventBus for which I'd need to know stream name in the EventBus send() method. I think you said you'd need this too?

You beat me to this message :).

We have $events in send(). meta.stream is a pretty strongly enforced convention; I don't think EventBus has ever not supported it. What if we just extracted it from the $events inside of send()?

This was my initial plan of attack.
I've been looking into reporting metrics (number of records) per stream by parsing the messages payload and extracting stream name from the records.
This is fine for the happy path where we pass send() an array and all messages are delivered. We'd record something along the lines of

mediawiki_eventbus_outgoing_events_total{stream_name="mediawiki_page_change_v1"} 3
mediawiki_eventbus_outgoing_events_total{stream_name="mediawiki_revision_create"} 3
mediawiki_eventbus_outgoing_events_total{stream_name="resource_change"} 1

If all events in the send() call have the same stream name, we can break down response status codes by stream without having to re-parse the response body. E.g.

mediawiki_eventbus_responses_total{status_code="201", stream_name: "mediawiki_page_change_v1"} 3
mediawiki_eventbus_responses_total{status_code="207", stream_name: "malformed_payload"} 5 // malformed test input

...Or we could consider changing the method signature to include $streamName ?

Are we guaranteed that all events in the array passed to send() will have the same stream name?

If we can rely on this guarantee, than book-keeping stream name should be easy. Passing it as method argument would be nice to have, but possibly redundant.
I would not break APIs for a "nice to have".

However, I would like to keep the SerDe at a minimum in this code path. If we pass an array of events, we can access meta.stream with $event['meta']['stream']; (xprofile tells me it's cheap).
If we'd have to parse and convert a string of events (or parse the response body - see above) from json, that could be a different story.

I made some progress on this (see also comment above). Here's an idea of how I would like to name and label metrics. I would like to start small and interate in Beta. Some code paths are difficult to test out locally.

For these metrics, the component would be eventbus. We'd be adding to the default mediawiki prefix.

1. number of calls to send()

Metric name: function_calls
Labels function_name=send. Example:

mediawiki_eventbus_function_calls{function_name="send"} 7

Global counter of how often this code path is invoked.

Don't know yet if actually useful other than for debug purposes. TBH it's mostly a curiosity, because I don't have
a good feeling fro how much eventbus is used outside use cases I'm familiar with.
This also potentially duplicates info recorded in deferred_updates (see below).

2. number of outgoing records

These would be the number of records send() is posting to the intake gateway.

Metrics name: outgoing_events_total
Labels: stream_name
Example:

mediawiki_eventbus_outgoing_events_total{stream_name="mediawiki_page_change_v1"}
2. response status codes by stream

Metric name: responses_total
Labels: status_code, stream_name
Example:

mediawiki_eventbus_responses_total{status_code="201", stream_name: "mediawiki_page_change_v1"} 3

Other considerations

event service name (eventgate name)

We can prase this from $this->url. Would this label be useful? Does it add much to stream_name? Afaik there is a 1:1 mapping between stream names and gateways.

maybe $schema if it isn't hard to get?

We can extract this by parsing the event payload. I need to validate whether all events passed to a send() call (events array arguments) will have the same $schema.

Any other easy/useful/relavent EventBus metrics are emitted.

We get a bunch of defaults in the deferred_updates component:

mediawiki_deferred_updates_total{http_method="post",type="MediaWiki_Deferred_MWCallableUpdate_MediaWiki_Extension_EventBus_EventBusHooks_sendResourceChangedEvent"} 1
mediawiki_deferred_updates_total{http_method="post",type="MediaWiki_Deferred_MWCallableUpdate_MediaWiki_Extension_EventBus_EventBusHooks_sendRevisionCreateEvent"} 3
mediawiki_deferred_updates_total{http_method="post",type="MediaWiki_Deferred_MWCallableUpdate_MediaWiki_Extension_EventBus_HookHandlers_MediaWiki_PageChangeHooks_sendEvents"} 3

event service name (eventgate name)

We can prase this from $this->url

It won't hurt to add the event service name itself as a EventBus instance property. The url itself might be nice to have too. I've found that it's okay (and sometimes good) if labels are redundant, as it makes it easier to do templating in Grafana. E.g. sometimes we might want to filter by the event_service_name, sometimes we might want to display the endpoint URLs in the graph.

Afaik there is a 1:1 mapping between stream names and gateways.

Yes, but it can change, as it is in stream config. It usually does not, but sometimes it does. E.g. we may want to produce mediawiki.page_content_change.v1 to a multi-DC Kafka cluster one day.


We should brain bounce on some possible EventBus changes too. For T346046 I'm considering getting rid of the EventBus $url param, and instead just injecting a $streamConfigs directly. Then EventBus itself can figure out where to POST based on $streamConfigs, rather than doing based on the EventBus instance created by EventBusFactory. I'm not sure about this, and perhaps it will be too much, but I keep running into awkwardness because EventBus doesn't have an instance of StreamConfigs to use.

event service name (eventgate name)

We can prase this from $this->url

It won't hurt to add the event service name itself as a EventBus instance property.

Sure. We def don't want to parse the same thing for every message.

The url itself might be nice to have too.

This we have already as an instance property, but I don't think we need to label with the full endpoint.
If we ever version bump the API, we could capture that at a later stage.

I've found that it's okay (and sometimes good) if labels are redundant, as it makes it easier to do templating in Grafana. E.g. sometimes we might want to filter by the event_service_name, sometimes we might want to display the endpoint URLs in the graph.

Sounds good! Labels won't impact cardinality.

Afaik there is a 1:1 mapping between stream names and gateways.

Yes, but it can change, as it is in stream config. It usually does not, but sometimes it does. E.g. we may want to produce mediawiki.page_content_change.v1 to a multi-DC Kafka cluster one day.

We could always reverse look up ESC, but point taken.

We should brain bounce on some possible EventBus changes too. For T346046 I'm considering getting rid of the EventBus $url param, and instead just injecting a $streamConfigs directly. Then EventBus itself can figure out where to POST based on $streamConfigs, rather than doing based on the EventBus instance created by EventBusFactory.

I got confused by this. But I am not sure if EB is awkward, or ESC is :)

Change #1049831 had a related patch set uploaded (by Gmodena; author: Gmodena):

[mediawiki/extensions/EventBus@master] eventbus: add instrumentation to send() method.

https://gerrit.wikimedia.org/r/1049831

Change #1051709 had a related patch set uploaded (by Gmodena; author: Gmodena):

[operations/mediawiki-config@master] beta: eventbus: enable instrumentation

https://gerrit.wikimedia.org/r/1051709

Change #1049831 merged by jenkins-bot:

[mediawiki/extensions/EventBus@master] eventbus: add instrumentation to send() method.

https://gerrit.wikimedia.org/r/1049831

Change #1051709 merged by jenkins-bot:

[operations/mediawiki-config@master] beta: eventbus: enable instrumentation

https://gerrit.wikimedia.org/r/1051709

Change #1053534 had a related patch set uploaded (by Gmodena; author: Gmodena):

[mediawiki/extensions/EventBus@master] EventBus: label meterics with event type name.

https://gerrit.wikimedia.org/r/1053534

Instrumentation has been enabled in beta. You can test it by modifying a page on https://simple.wikipedia.beta.wmflabs.org.

Metrics are not scraped by prometheus (I think), but can be accessed from the statsd exporter endpoint on an appserver localhost.
For example:

$ ssh deployment-mediawiki12.deployment-prep.eqiad1.wikimedia.cloud
$ curl localhost:9112/metrics | grep -i eventbus

will return a bunch of metrics

Change #1053534 merged by jenkins-bot:

[mediawiki/extensions/EventBus@master] EventBus: label metrics with event type name.

https://gerrit.wikimedia.org/r/1053534