Page MenuHomePhabricator

MEP: canary events so we know events are flowing through pipeline
Closed, DuplicatePublic

Description

These alarms would help with two use cases that are almost opposite:

  • Topics that have very low flow of events, maybe not even one an hour and would trigger unnecessary alarms
  • Topics that are seeing a constant flow of events and for a significant interval they see none indicating an outage

Implementation idea:

Kafka jumbo-eqiad itself has all topics we'd want to ingest and monitor. We can implement a function gets this list of all Kakfa topics, maps them to stream names, and then queries each eventgate-wikimedia instance for if it is allowed to produce that stream. If it is, then consume the latest message from kafka for that stream. If the timestamp is not too old (newer than 90 days) then that stream should be both ingested and monitored.

If the stream should be monitored, get the stream's event schema's examples and POST them as a canary event to that eventgate-wikimiedia instance.

Event Timeline

Milimetric renamed this task from MEP: canary alarms so we know events are flowing through pipeline to MEP: canary events so we know events are flowing through pipeline .May 7 2020, 4:12 PM
Milimetric triaged this task as Medium priority.
Milimetric moved this task from Incoming to Event Platform on the Analytics board.

We could, but how to know if that hour is absent because of a lack of data,
or due to loss?

If we emit canary events into all topics then we know that all hours should
have at least one event, and we can def do so.

Agreed, agreed, my note to self was rather whether we needed * a different* method for alarming entirely but * i think* that with a strategy like the one we used for not refined hours will be sufficient.

Thoughts:

Could we re-use some of the EventGate kubernetes readinessProbe logic for this?

EventGate in k8s is configured with a readinessProbe, which is just a command that k8s uses to determine when a pod is ready to handle traffic and should be pooled. This is currently configured to use the eventgate-wikimedia post-events script. Given a schema URL, this script extracts an event from the JSONSChema examples, possibly sets meta.dt to current time, and then POSTs the event to the local EventGate service.

Since each eventgate instance (mostly) knows the list of streams it is allowed to produce, we could make this post-events script iterate through those streams and guess the schema URI given the schema_title for the stream. So, in the stream config we'd have e.g.

mediawiki.api-request:
  schema_title: mediawiki/api/request

A good guess at schema URI would be /mediawiki/api/request/latest. Our schema repository CI ensures that this will be the correct schema URI for that schema title.

post-events could then extract one of the JSONSchema examples and modify both meta.dt and meta.stream, and then post that event to the local eventgate instance.

This should work for most streams. However, in stream config we allow for regex patterns of stream names, e.g. so that all mediawiki.job.\.* streams share the same config like schema_title. When regexes are in stream config, there is no way to know the exact stream names that should have canary events posted to them.

This is very similar to the problems described in T251609: Automate ingestion and refinement into Hive of event data from Kafka using stream configs and canary/heartbeat events, namely that we have no centralized way of knowing the complete list of streams that are active.

Alternatively: perhaps instead of relying on stream configs to figure out what canary events to generate, we can manage a list of streams to do this for. We could build this list semi-dynamically by combining streams defined in mediawiki-config + a static list. Or perhaps we can add an HTTP endpoint to each eventgate instance to expose the streams they know about, and use those for the list too. The stream list would also have to specify the eventgate endpoint to which the canary event should be posted. This list would be something like:

# perhaps we can use this list for ingestion too?
streams_to_ingest:
  mediawiki.api-request:
    schema_uri: /mediawiki/api/request/latest
    event_service: https://eventgate-analytics.discovery.wmnet:4592/v1/events
    monitoring_enabled: true # if true, a post-events script could build and post a canary event

Concat this with a dynamic list built from mediawiki-config non regex streams, and we'd likely capture all of the EventLogging use cases too.

...Sigh, except for...what about PI's idea for stream CCing in EventLogging? I guess this is the same problem as the regex stream name, as they'd have to use that in stream config to produce the event.

Basically if there is not a specific stream name defined in wgEventStreams or in our static list, we cannot generate a canary event for it.

Basically if there is not a specific stream name defined in wgEventStreams or in our static list, we cannot generate a canary event for it.

that seems fair, also, the means by which canary events gets produced can be such that "at least one" event gets produced, producing several canary events for the hour should not be a problem. Having the "keep-alive" for K8 do it seems a possible solution but maybe we can also consider a synthetic client that just would do this very thing that gets deployed with event gate and posts events at random intervals that are smaller than 1 hour.

In https://phabricator.wikimedia.org/T251609#6152803 we figured out a mostly clean way to implement this. Am on it!

Since the implementation of this and T251609 are so similar, I'm going to merge this task into T251609 and redescribe that one to mention canary events.