Page MenuHomePhabricator

Shift frack alerting to use alertmanager instead of icinga
Open, Needs TriagePublic

Description

As icinga is phased out, the frack hosts will need to send their alerts to alertmanager. We are looking to use the frack prometheus instance to send the alerts to alertmanager.

Tasks to accomplish:

  • update pfw / iptables rules for frmon to contact alerts hosts
  • verify what metrics currently in prometheus will work for alerts
  • set up config in frack prometheus to send alerts to alerts hosts
    • host config
    • user / service account
  • test creating or moving a metric/alert to prometheus
  • see if currently reported nsca metrics in /var/spool/prometheus/nagios_nsca.prom on each host would be usable

Helpful docs / links:
https://wikitech.wikimedia.org/wiki/Alertmanager
https://gerrit.wikimedia.org/r/plugins/gitiles/operations/alerts/+/refs/heads/master
https://prometheus-eqiad.wikimedia.org/ops/config
https://prometheus-eqiad.wikimedia.org/ops/alerts?search
https://prometheus.io/docs/prometheus/latest/configuration/alerting_rules/=
https://alerts.wikimedia.org/?q=%40state%3Dactive&q=%40cluster%3Dwikimedia.org

Event Timeline

In the puppet-private repo:
commit f94d1d4501e54136766722074b6821502213ccc1 staged on T367370_prom_alerts branch for pfw config.
commit 064db8663032904ceeaa6b1a32fc6ab93343dcc4 staged on T367370_prom_alerts branch for iptables config.

Something else I wanted to add: with respect to authoring and deploying alerts we have essentially centralized all alerts in operations/alerts.git repository. Said repo already contains scaffolding such as CI/tests integration and if you'd like to also commit FR alerts in there that's no problem; deployment is straightforward in the sense that in production we clone the repo and then selectively deploy alerts based on user-provided directions (#deploy comments at the top of the file). HTH