Page MenuHomePhabricator

We should get an alarm for partitions that have no data for topics that have data influx at all times, most of the mediawiki.*
Closed, ResolvedPublic

Event Timeline

Thanks.

It'd be really nice to automate this in some way rather than just setting up specific thresholds for specific topics. This is a difficult thing to do, because sometimes topics actually have no data. Right now we have some automated alerting some raw (camus imported) data and some specific topics & thresholds for Kafka. The Refine alerts we have now only work if there is raw data that has not been refined, or if Camus is lagging importing from Kafka.

We could try to do some historical throughput anomaly detection for each refined dataset. This sounds pretty difficult to set up, but should be pretty thorough.

Or, perhaps a simpler idea would be to emit heartbeat events, perhaps once per hour, and alert if they don't show up. If we do T242454: Add examples to all event schemas, this might be not too hard to do. We could make a little service that periodically joins together stream config and schemas and uses an examples from each schema to generate a heartbeat event. Then, we could at the very least alert if a heartbeat event doesn't show up in an hour for every refined event table in Hive. We could also add extra alerting at the other levels too (Kafka and raw JSON data in Hadoop), but this might take more work to maintain.

I think this would be a good use case for the data quality alarms.
It would be super-easy to setup an alarm for a given data set.
Just a simple query and a couple lines of config.

However, I couldn't think of a way to execute these for all data sets at once, with the current pipeline.
But! We could do it with Airflow!

Follow up from our discussion:

  • We will implement canary alarms for all streams in MEP, this reuires schemas per stream with a sample event we can use as canary (1 event per hour is sufficient)
  • We will implement anomaly detection for the edit sream as it s a very predictable one, in which a much more smaller throughput than what we normally sustain probably indicates of a problem
Milimetric moved this task from Incoming to Event Platform on the Analytics board.

canary events + monitoring exist.