Page MenuHomePhabricator

Move > 60% of observability Prometheus-based checks to Alertmanager
Closed, ResolvedPublic

Description

This task tracks moving Prometheus-based o11y alerts from Icinga to Alertmanager. Broad steps of the work involved:

  • Audit all o11y Prometheus-based Icinga alerts.
  • For completeness reasons we should also audit native Icinga alerts specific to o11y, those could be eliminated or not be relevant anymore.
  • Subdivide by general (sub)system and port each individually

Metrics

  • modules/profile/manifests/prometheus/ops.pp: monitoring::check_prometheus { 'prometheus_config_reload_fail':
  • modules/prometheus/manifests/server.pp "prometheus restarted" alerts
  • modules/profile/manifests/statsd.pp: monitoring::check_prometheus { 'statsd_udp_inbound_errors':
  • modules/profile/manifests/thanos/alerts.pp
  • modules/profile/manifests/prometheus/alerts.pp monitoring::check_prometheus { "node_textfile_stale_${site}" }
  • unavailability alerts per-exporter
  • job unavailability

Alerting

  • modules/profile/manifests/alertmanager/alerts.pp
  • modules/profile/manifests/prometheus/alerts.pp monitoring::check_prometheus { "icinga_check_latency_${site}"}

Logging

  • modules/monitoring/manifests/alerts/rsyslog.pp
  • modules/profile/manifests/logstash/alerts.pp
  • modules/profile/manifests/logstash/collector.pp monitoring::check_prometheus { 'logstash-udp-loss-ratio':
  • modules/profile/manifests/logstash/collector7.pp monitoring::check_prometheus { 'logstash-udp-loss-ratio':

Details

SubjectRepoBranchLines +/-
operations/puppetproduction+0 -22
operations/puppetproduction+95 -0
operations/alertsmaster+52 -0
operations/puppetproduction+4 -1
operations/puppetproduction+2 -2
operations/puppetproduction+27 -0
operations/puppetproduction+0 -15
operations/alertsmaster+73 -0
operations/puppetproduction+0 -17
operations/alertsmaster+41 -0
operations/puppetproduction+15 -4
operations/puppetproduction+0 -90
operations/alertsmaster+261 -0
operations/alertsmaster+36 -0
operations/puppetproduction+0 -15
operations/puppetproduction+0 -109
operations/alertsmaster+187 -0
operations/alertsmaster+41 -0
operations/puppetproduction+0 -11
operations/alertsmaster+43 -0
operations/puppetproduction+18 -0
operations/alertsmaster+79 -0
operations/puppetproduction+0 -31
operations/alertsmaster+393 -0
operations/puppetproduction+0 -290
operations/alertsmaster+86 -0
operations/puppetproduction+0 -76
Show related patches Customize query in gerrit

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

Change 714372 had a related patch set uploaded (by Filippo Giunchedi; author: Filippo Giunchedi):

[operations/alerts@master] o11y: port thanos-rule alerts from Icinga

https://gerrit.wikimedia.org/r/714372

Change 714373 had a related patch set uploaded (by Filippo Giunchedi; author: Filippo Giunchedi):

[operations/puppet@production] profile: remove thanos-compact alerts, ported to alerts.git

https://gerrit.wikimedia.org/r/714373

Change 714373 merged by Filippo Giunchedi:

[operations/puppet@production] profile: remove thanos-compact alerts, ported to alerts.git

https://gerrit.wikimedia.org/r/714373

Change 714372 merged by Filippo Giunchedi:

[operations/alerts@master] o11y: port thanos-compact alerts from Icinga

https://gerrit.wikimedia.org/r/714372

Change 714541 had a related patch set uploaded (by Filippo Giunchedi; author: Filippo Giunchedi):

[operations/puppet@production] profile: remove thanos alerts, moved to alerts.git

https://gerrit.wikimedia.org/r/714541

Change 714543 had a related patch set uploaded (by Filippo Giunchedi; author: Filippo Giunchedi):

[operations/alerts@master] o11y: add alerts ported from icinga/upstream

https://gerrit.wikimedia.org/r/714543

Change 714541 merged by Filippo Giunchedi:

[operations/puppet@production] profile: remove thanos alerts, moved to alerts.git

https://gerrit.wikimedia.org/r/714541

Change 714543 merged by Filippo Giunchedi:

[operations/alerts@master] o11y: add alerts ported from icinga/upstream

https://gerrit.wikimedia.org/r/714543

Change 715032 had a related patch set uploaded (by Filippo Giunchedi; author: Filippo Giunchedi):

[operations/alerts@master] o11y: add prometheus alerts

https://gerrit.wikimedia.org/r/715032

Change 715033 had a related patch set uploaded (by Filippo Giunchedi; author: Filippo Giunchedi):

[operations/puppet@production] prometheus: remove alerts moved to AM

https://gerrit.wikimedia.org/r/715033

Change 715033 merged by Filippo Giunchedi:

[operations/puppet@production] prometheus: remove alerts moved to AM

https://gerrit.wikimedia.org/r/715033

Change 715032 merged by Filippo Giunchedi:

[operations/alerts@master] o11y: add prometheus alerts

https://gerrit.wikimedia.org/r/715032

Change 719123 had a related patch set uploaded (by Filippo Giunchedi; author: Filippo Giunchedi):

[operations/alerts@master] o11y: add udp receive errors for statsd

https://gerrit.wikimedia.org/r/719123

Change 719124 had a related patch set uploaded (by Filippo Giunchedi; author: Filippo Giunchedi):

[operations/puppet@production] statsd: remove statsd_udp_inbound_errors

https://gerrit.wikimedia.org/r/719124

Change 719126 had a related patch set uploaded (by Filippo Giunchedi; author: Filippo Giunchedi):

[operations/puppet@production] prometheus: add ThanosSidecarUploadFailure to prometheus/ops

https://gerrit.wikimedia.org/r/719126

Change 719126 merged by Filippo Giunchedi:

[operations/puppet@production] prometheus: add ThanosSidecarUploadFailure to prometheus/ops

https://gerrit.wikimedia.org/r/719126

Change 719123 merged by Filippo Giunchedi:

[operations/alerts@master] o11y: add udp receive errors for statsd

https://gerrit.wikimedia.org/r/719123

Change 719124 merged by Filippo Giunchedi:

[operations/puppet@production] statsd: remove statsd_udp_inbound_errors

https://gerrit.wikimedia.org/r/719124

Change 720063 had a related patch set uploaded (by Cwhite; author: Cwhite):

[operations/alerts@master] o11y: add rsyslog alerts

https://gerrit.wikimedia.org/r/720063

Change 720079 had a related patch set uploaded (by Cwhite; author: Cwhite):

[operations/alerts@master] o11y: add logstash alerts

https://gerrit.wikimedia.org/r/720079

Change 720093 had a related patch set uploaded (by Cwhite; author: Cwhite):

[operations/puppet@production] logging: clean up legacy logstash alerts

https://gerrit.wikimedia.org/r/720093

Change 720063 merged by Cwhite:

[operations/alerts@master] o11y: add rsyslog alerts

https://gerrit.wikimedia.org/r/720063

Change 720079 merged by Cwhite:

[operations/alerts@master] o11y: add logstash alerts

https://gerrit.wikimedia.org/r/720079

Change 720093 merged by Cwhite:

[operations/puppet@production] logging: clean up legacy logstash alerts

https://gerrit.wikimedia.org/r/720093

Change 724761 had a related patch set uploaded (by Filippo Giunchedi; author: Filippo Giunchedi):

[operations/alerts@master] o11y: port alertmanager alerts

https://gerrit.wikimedia.org/r/724761

Change 724771 had a related patch set uploaded (by Filippo Giunchedi; author: Filippo Giunchedi):

[operations/puppet@production] icinga: remove alertmanager::alerts

https://gerrit.wikimedia.org/r/724771

Change 725884 had a related patch set uploaded (by Filippo Giunchedi; author: Filippo Giunchedi):

[operations/alerts@master] o11y: port Icinga checks

https://gerrit.wikimedia.org/r/725884

Change 725885 had a related patch set uploaded (by Filippo Giunchedi; author: Filippo Giunchedi):

[operations/puppet@production] alerts: remove icinga overload alert, moved to AM

https://gerrit.wikimedia.org/r/725885

Change 725885 merged by Filippo Giunchedi:

[operations/puppet@production] alerts: remove icinga overload alert, moved to AM

https://gerrit.wikimedia.org/r/725885

Change 725884 merged by Filippo Giunchedi:

[operations/alerts@master] o11y: port Icinga checks

https://gerrit.wikimedia.org/r/725884

Change 724761 merged by Filippo Giunchedi:

[operations/alerts@master] o11y: port alertmanager alerts

https://gerrit.wikimedia.org/r/724761

Change 724771 merged by Filippo Giunchedi:

[operations/puppet@production] icinga: remove alertmanager::alerts

https://gerrit.wikimedia.org/r/724771

Change 730206 had a related patch set uploaded (by Filippo Giunchedi; author: Filippo Giunchedi):

[operations/puppet@production] thanos: deploy global alerts only on hosts running thanos-rule

https://gerrit.wikimedia.org/r/730206

Change 730206 merged by Filippo Giunchedi:

[operations/puppet@production] thanos: deploy global alerts only on hosts running thanos-rule

https://gerrit.wikimedia.org/r/730206

lmata triaged this task as High priority.Oct 29 2021, 5:11 PM

Change 743394 had a related patch set uploaded (by Filippo Giunchedi; author: Filippo Giunchedi):

[operations/alerts@master] team-sre: port node-exporter textfile stale alert

https://gerrit.wikimedia.org/r/743394

Change 743395 had a related patch set uploaded (by Filippo Giunchedi; author: Filippo Giunchedi):

[operations/puppet@production] prometheus: remove textfile stale alert

https://gerrit.wikimedia.org/r/743395

Change 744033 had a related patch set uploaded (by Filippo Giunchedi; author: Filippo Giunchedi):

[operations/alerts@master] team-sre: port job unavailable alert

https://gerrit.wikimedia.org/r/744033

Change 744035 had a related patch set uploaded (by Filippo Giunchedi; author: Filippo Giunchedi):

[operations/puppet@production] prometheus: remove job unavailable alert

https://gerrit.wikimedia.org/r/744035

Change 743394 merged by Filippo Giunchedi:

[operations/alerts@master] team-sre: port node-exporter textfile stale alert

https://gerrit.wikimedia.org/r/743394

Change 743395 merged by Filippo Giunchedi:

[operations/puppet@production] prometheus: remove textfile stale alert

https://gerrit.wikimedia.org/r/743395

Change 744033 merged by Filippo Giunchedi:

[operations/alerts@master] team-sre: port job unavailable alert

https://gerrit.wikimedia.org/r/744033

Change 744035 merged by Filippo Giunchedi:

[operations/puppet@production] prometheus: remove job unavailable alert

https://gerrit.wikimedia.org/r/744035

Change 778259 had a related patch set uploaded (by Filippo Giunchedi; author: Filippo Giunchedi):

[operations/alerts@master] sre: add alerts for exporter-specific unavailability

https://gerrit.wikimedia.org/r/778259

Change 778261 had a related patch set uploaded (by Filippo Giunchedi; author: Filippo Giunchedi):

[operations/puppet@production] thanos: add recording rules for exporter-specific availability

https://gerrit.wikimedia.org/r/778261

Change 778261 merged by Filippo Giunchedi:

[operations/puppet@production] thanos: add recording rules for exporter-specific availability

https://gerrit.wikimedia.org/r/778261

Change 778354 had a related patch set uploaded (by Cwhite; author: Cwhite):

[operations/puppet@production] thanos: fix yaml error

https://gerrit.wikimedia.org/r/778354

Change 778354 merged by Cwhite:

[operations/puppet@production] thanos: fix yaml error

https://gerrit.wikimedia.org/r/778354

Change 784629 had a related patch set uploaded (by Filippo Giunchedi; author: Filippo Giunchedi):

[operations/puppet@production] thanos: reload rule via http endpoint

https://gerrit.wikimedia.org/r/784629

Change 784629 merged by Filippo Giunchedi:

[operations/puppet@production] thanos: reload rule via http endpoint

https://gerrit.wikimedia.org/r/784629

Change 778259 merged by Filippo Giunchedi:

[operations/alerts@master] sre: add alerts for exporter-specific unavailability

https://gerrit.wikimedia.org/r/778259

Change 784635 had a related patch set uploaded (by Filippo Giunchedi; author: Filippo Giunchedi):

[operations/puppet@production] thanos: aggregate exporter 'up' metrics

https://gerrit.wikimedia.org/r/784635

Change 784636 had a related patch set uploaded (by Filippo Giunchedi; author: Filippo Giunchedi):

[operations/puppet@production] prometheus: remove per-exporter up checks

https://gerrit.wikimedia.org/r/784636

Change 784635 merged by Filippo Giunchedi:

[operations/puppet@production] thanos: aggregate exporter 'up' metrics

https://gerrit.wikimedia.org/r/784635

Change 784636 merged by Filippo Giunchedi:

[operations/puppet@production] prometheus: remove per-exporter up checks

https://gerrit.wikimedia.org/r/784636

All o11y Prometheus-based alerts mentioned in the description have been migrated, resolving!