This task tracks moving Prometheus-based o11y alerts from Icinga to Alertmanager. Broad steps of the work involved:
- Audit all o11y Prometheus-based Icinga alerts.
- For completeness reasons we should also audit native Icinga alerts specific to o11y, those could be eliminated or not be relevant anymore.
- Subdivide by general (sub)system and port each individually
Metrics
- modules/profile/manifests/prometheus/ops.pp: monitoring::check_prometheus { 'prometheus_config_reload_fail':
- modules/prometheus/manifests/server.pp "prometheus restarted" alerts
- modules/profile/manifests/statsd.pp: monitoring::check_prometheus { 'statsd_udp_inbound_errors':
- modules/profile/manifests/thanos/alerts.pp
- modules/profile/manifests/prometheus/alerts.pp monitoring::check_prometheus { "node_textfile_stale_${site}" }
- unavailability alerts per-exporter
- job unavailability
Alerting
- modules/profile/manifests/alertmanager/alerts.pp
- modules/profile/manifests/prometheus/alerts.pp monitoring::check_prometheus { "icinga_check_latency_${site}"}
Logging
- modules/monitoring/manifests/alerts/rsyslog.pp
- modules/profile/manifests/logstash/alerts.pp
- modules/profile/manifests/logstash/collector.pp monitoring::check_prometheus { 'logstash-udp-loss-ratio':
- modules/profile/manifests/logstash/collector7.pp monitoring::check_prometheus { 'logstash-udp-loss-ratio':