Page MenuHomePhabricator

kafka-logging: ensure cluster wide failure mode alerting coverage
Open, MediumPublic

Description

In https://gerrit.wikimedia.org/r/c/operations/puppet/+/887342 we removed the paging pgrep kafka style checks due to false positive pages and their inaccurate representation of broker/cluster health.

While we have other alerts in place that will alarm in the event of a kafka-logging outage, let's investigate improvements to monitor and alert on kafka-logging cluster health directly, as well as consider escalating kafka-logging outage as a paging alert either to the o11y team directly, the broader SRE team, or perhaps a stepped escalation with timeouts.