Page MenuHomePhabricator

Automated event stream throughput alerting for important state change streams
Open, Needs TriagePublic

Description

In T329064, the mediawiki.page-undelete stream was empty for over 3 months. No one noticed.

We should have some kind of automated throughput monitoring for important streams.

Suggestion:

  • Add some info to stream config indicating the approximate expected throughput of a stream
  • Add a setting for enabling throughput alerting
  • Write a script that can be executed by AlertManager or Icinga to check throughput of all streams with throughput alerting enabled.

Details

Event Timeline

Change #1168119 had a related patch set uploaded (by Gmodena; author: Gmodena):

[operations/alerts@master] WIP: eventbus: register with team-data-engineering.

https://gerrit.wikimedia.org/r/1168119

Change #1168119 merged by jenkins-bot:

[operations/alerts@master] eventbus: register with team-data-engineering.

https://gerrit.wikimedia.org/r/1168119