Common information
- dashboard: https://grafana.wikimedia.org/d/taff979/prometheus-tsdb-cardinality-monitoring?orgId=1&from=now-14d&to=now&timezone=utc&var-prometheus=k8s-dse&var-site=eqiad
- description: The average samples-per-series ratio on k8s-dse / eqiad has been below 85% of the expected scrape rate for more than 72 hours. A large number of active series are receiving few or no samples, indicating accumulated zombie series — likely the residual effect of a cardinality explosion where new series were continuously created via label churn and old ones were never cleaned up. Current ratio: 11.67m samples/s per series.
- runbook: https://wikitech.wikimedia.org/wiki/Prometheus#Runbooks
- summary: Zombie series detected on k8s-dse (eqiad)
- alertname: PrometheusZombieSeriesDetected
- prometheus: k8s-dse
- recorder: thanos-rule@main
- severity: task
- site: eqiad
- source: thanos
- team: o11y
Firing alerts
- dashboard: https://grafana.wikimedia.org/d/taff979/prometheus-tsdb-cardinality-monitoring?orgId=1&from=now-14d&to=now&timezone=utc&var-prometheus=k8s-dse&var-site=eqiad
- description: The average samples-per-series ratio on k8s-dse / eqiad has been below 85% of the expected scrape rate for more than 72 hours. A large number of active series are receiving few or no samples, indicating accumulated zombie series — likely the residual effect of a cardinality explosion where new series were continuously created via label churn and old ones were never cleaned up. Current ratio: 11.67m samples/s per series.
- runbook: https://wikitech.wikimedia.org/wiki/Prometheus#Runbooks
- summary: Zombie series detected on k8s-dse (eqiad)
- alertname: PrometheusZombieSeriesDetected
- prometheus: k8s-dse
- recorder: thanos-rule@main
- severity: task
- site: eqiad
- source: thanos
- team: o11y
- Source