Page MenuHomePhabricator

Add alerting to eventbus and eventgate for drastic changes in event rate production.
Closed, ResolvedPublic2 Estimated Story Points

Description

A MediaWiki configuration change caused analytics event production to stop.

We did not notice the issue until it was reported by @Urbanecm_WMF. Although the problem was not in EventBus or EventGate, these systems should have alerted us when the event production rate dropped to zero.

EventGate currently has alerts for latency and error rate spikes. However, EventBus is not yet registered with AlertManager.

AC):

  • EventGate should alert if produce rate drops to 0.
  • EventBus should be registered with AlertManager.

Event Timeline

Change #1167620 had a related patch set uploaded (by Gmodena; author: Gmodena):

[operations/alerts@master] WIP: eventgate: alert on traffic deviation.

https://gerrit.wikimedia.org/r/1167620

Change #1168119 had a related patch set uploaded (by Gmodena; author: Gmodena):

[operations/alerts@master] WIP: eventbus: register with team-data-engineering.

https://gerrit.wikimedia.org/r/1168119

Here is a summary of a discussion we had on DPE's internal slack.

EventGate

EventGate is registered with AlertManager. Alerts will fire on perf degradation (latency and error rate increases), but not on traffic changes.
Recently, traffic stopped being routed through eventgate-analytics and we did not notice till a user pinged us T398187: eventgate-analytics has stopped producing events since 2025-06-25).

I would like to trigger alerts when:

  • A significant produce rate deviation is detected. Based on data analysis I would define significant as the current produce rate deviates by more than 5% from the 1h average (baseline).
  • There is no produced traffic for more than 5 minutes. The internal threshold is to provide slack during maintenance / planned service downtime restart (although restarts should happen in a rolling fashion!). I can be easily convinced we should have tighter deadlines (1 minute?).

Note that these alerts are defined at service level, not single stream. For maintenance, when traffic spikes are expected, we inhibit alerts as described in https://wikitech.wikimedia.org/wiki/Alertmanager#Silences_&_acknowledgements

While these alerts fire globally, important streams (e.g. mediawiki.page_change.v1) can be instrumented separately (requested in T329070: Automated event stream throughput alerting for important state change streams).
When possible, for these use cases, my suggestion would be instrument and alert (also) upstream of eventagte (e.g EventBus. See below).

EventBus

EventBus is not registered with alert manager yet. The linked CR adds alerts for a subset of streams that are supported by the DPE team.

Here is my proposal:

  1. Alert data-engineering only for events of TYPE_EVENT (e.g. mediawiki state changes & c.). I don't think the team could do much about jobqueue. Moreover, TYPE_JOB (and similar) have a bursty behaviour that requires dedicated alerting rules.
  2. A global alert on outgoing (EventBus -> EventGate) requests drops. My proposed (initial) definition of drop is: current 5m rate is less than 50% of 1h average. An alert will fire if the condition holds for 5 minutes.
  3. Have dedicated alerts for outgoing traffic drops for "high prio" streams (e.g. mediawiki.page_change.v1). Same rules as above, but with a shorter sampling period. An alert will fire if the condition holds for 1 minute.
  4. A global alert when the outgoing (requests) / accepted (2xx by eventgate) ratio goes below 99%.
  5. A global alert for server side (eventgate) 5xx errors increase.
  6. A global alert for client side (eventbus) 4xx errors increase.

For cases 4-6 we don't have per stream visibility (it's an implementation limitation I'd be happy to discuss further). But we can correlate with EventGate's time series. The test suite contains some examples of what these alerts would look like, as well as their triggering conditions.
The thresholds are based on data analysis, but are just meant as a starting point Ideally these alerts should be pinned to SLIs.

@gmodena +1 on both patches. I did not deeply review the alert queries, but I like the intended alerts! Let's try and see how it goes, and we can adjust if needed.

When these get deployed, let's make sure to send a notice to DPE ops week folks to be on the lookout and to notify us.

I don't remember who needs to review to deploy operations/alerts. Maybe @fgiunchedi can help us find the right reviewer? Thank you!

@gmodena +1 on both patches. I did not deeply review the alert queries, but I like the intended alerts! Let's try and see how it goes, and we can adjust if needed.

When these get deployed, let's make sure to send a notice to DPE ops week folks to be on the lookout and to notify us.

I don't remember who needs to review to deploy operations/alerts. Maybe @fgiunchedi can help us find the right reviewer? Thank you!

The submitter's team mate usually review alerts changes as they have the most context, we have focused on self-service and extensive CI for alerts.git so anyone in wmf can +2 / submit. Puppet deploys the alerts on its next run. HTH!

Change #1168119 merged by jenkins-bot:

[operations/alerts@master] eventbus: register with team-data-engineering.

https://gerrit.wikimedia.org/r/1168119

Change #1167620 merged by jenkins-bot:

[operations/alerts@master] eventgate: alert on traffic deviation.

https://gerrit.wikimedia.org/r/1167620

Change #1172280 had a related patch set uploaded (by Gmodena; author: Gmodena):

[operations/alerts@master] data-engineering: eventbus: increase anomaly detection threshold

https://gerrit.wikimedia.org/r/1172280

Change #1172280 merged by jenkins-bot:

[operations/alerts@master] data-engineering: eventbus: increase anomaly detection threshold

https://gerrit.wikimedia.org/r/1172280

Change #1174012 had a related patch set uploaded (by Gmodena; author: Gmodena):

[operations/alerts@master] data-engineering: eventgate: raise basline for anomalies

https://gerrit.wikimedia.org/r/1174012

Change #1174012 merged by jenkins-bot:

[operations/alerts@master] data-engineering: eventgate: baseline for anomalies

https://gerrit.wikimedia.org/r/1174012

Trying to tweak the new produce rate anomoly alert. I came up with this attempt:

https://grafana.wikimedia.org/goto/1nEdr-QNg?orgId=1

I am guessing a bit, so if anyone has tips please help!

I started preparing a patch, but realized I have a lot to learn. Ran out of time for this week.

Perhaps we should just disable the alert for now, it is too noisy!

Change #1175520 had a related patch set uploaded (by Ottomata; author: Ottomata):

[operations/alerts@master] Disable EventgateProduceRateAnomaly for eventgate-main

https://gerrit.wikimedia.org/r/1175520

Change #1175520 merged by jenkins-bot:

[operations/alerts@master] Disable EventgateProduceRateAnomaly for eventgate-main

https://gerrit.wikimedia.org/r/1175520

Disabled the produce anomaly alert eventgate-main for now to reduce alert spam.

Looked at this for a bit with @JAllemandou today. We think the total eventgate produce rate anomaly detection based on total ratio won't really be that useful, especially for eventgate-main where things are bursty. Comparing ratios per stream might not work well either, as small streams ratio may vary a lot, e.g. 10 / sec vs 3 per sec is 66% difference.

I think perhaps the EventgateProduceRateStop will be our best chance to catch the reported problem in the future.

I'll leave the non eventgate-main EventgateProduceRateAnomaly alerts in place, since they aren't spamming, and maybe they will give us some signal if they ever do alert.