Page MenuHomePhabricator

Deploy Alertmanager for alerting infrastructure phase 1
Open, Needs TriagePublic

Description

This task tracks the alerting infrastructure roadmap phase 1, namely introducing Prometheus Alertmanager to production in a limited and read-only fashion.

Deliverables:

  • Alertmanager deployed in HA/clustered mode in two sites
  • Alerts dashboard deployed and available behind HTTP authentication
  • IRC bot deployed alongside Alertmanager and sending notifications to a test channel
  • All Icinga outstanding alerts show up in alerts dashboard
  • Grouping and inhibition of alerts is working as expected

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptMon, Jul 27, 1:54 PM
fgiunchedi moved this task from Inbox to In progress on the observability board.Mon, Jul 27, 1:55 PM
fgiunchedi moved this task from Backlog to Doing on the User-fgiunchedi board.Wed, Jul 29, 3:34 PM
bd808 added a subscriber: bd808.Thu, Jul 30, 10:56 PM

Jason setup alertmanager in our metricsinfra project. One thing I just found out about it is that the package shipped for Buster is ancient. v0.15.3 was tagged upstream on 2018-11-09. There is a package for Sid however of v0.21.0 which was tagged upstream on 2020-06-16. I figured out this version issue while trying to get https://github.com/prymitive/karma running as a gui to view and silence alerts.

Jason setup alertmanager in our metricsinfra project. One thing I just found out about it is that the package shipped for Buster is ancient. v0.15.3 was tagged upstream on 2018-11-09. There is a package for Sid however of v0.21.0 which was tagged upstream on 2020-06-16. I figured out this version issue while trying to get https://github.com/prymitive/karma running as a gui to view and silence alerts.

I can confirm I'm running into the same problem with Buster's version, and found the same solution to be working (i.e. the package from testing/unstable)

Change 617688 had a related patch set uploaded (by Filippo Giunchedi; owner: Filippo Giunchedi):
[operations/puppet@production] alertmanager: add IRC notifier

https://gerrit.wikimedia.org/r/617688

Change 617689 had a related patch set uploaded (by Filippo Giunchedi; owner: Filippo Giunchedi):
[operations/puppet@production] role: add alertmanager::irc to alerting_host

https://gerrit.wikimedia.org/r/617689

re: showing icinga alerts as prometheus/alertmanager alerts, the plan ATM looks like this:

  • export icinga alerts (in HARD state, i.e. shown at https://icinga.wikimedia.org/alerts) as Prometheus metrics, in the form of sth like icinga_alert{host="host",status="CRITICAL",service="description of the service",information="output from the plugin"}.
    • Augment prometheus-icinga-exporter with such capability, technically I believe it can be all inferred from status.dat. However icinga also offers JSON export from the cgi, which we could use instead and massage as needed. What do you think @colewhite ?
  • Decide where to store such metrics, the metric above has the obvious potential for a cardinality explosion. However this setup is meant to ease the transition to alertmanager. The expected state down the road is that most alerts will be Prometheus-native and icinga will have fewer and fewer alerts. Having said that, setting up a small Prometheus instance just for this isn't too much work

Change 618284 had a related patch set uploaded (by Filippo Giunchedi; owner: Filippo Giunchedi):
[operations/puppet@production] prometheus: lowercase alerts annotations

https://gerrit.wikimedia.org/r/618284

re: showing icinga alerts as prometheus/alertmanager alerts, the plan ATM looks like this:

  • export icinga alerts (in HARD state, i.e. shown at https://icinga.wikimedia.org/alerts) as Prometheus metrics, in the form of sth like icinga_alert{host="host",status="CRITICAL",service="description of the service",information="output from the plugin"}.

FYI we already have a script that is able to parse an Icinga status.dat file, it was an effort made by @jbond and me. See modules/icinga/files/icinga_status.py in the Puppet repo if that could be useful for you.

re: showing icinga alerts as prometheus/alertmanager alerts, the plan ATM looks like this:

  • export icinga alerts (in HARD state, i.e. shown at https://icinga.wikimedia.org/alerts) as Prometheus metrics, in the form of sth like icinga_alert{host="host",status="CRITICAL",service="description of the service",information="output from the plugin"}.

FYI we already have a script that is able to parse an Icinga status.dat file, it was an effort made by @jbond and me. See modules/icinga/files/icinga_status.py in the Puppet repo if that could be useful for you.

Thank you! I'll take a look at that too in addition to what prometheus-icinga-exporter already has

Change 618284 merged by Filippo Giunchedi:
[operations/puppet@production] prometheus: lowercase alerts annotations

https://gerrit.wikimedia.org/r/618284

Change 618504 had a related patch set uploaded (by Filippo Giunchedi; owner: Filippo Giunchedi):
[operations/debs/prometheus-icinga-exporter@master] Init retry_count at each collection

https://gerrit.wikimedia.org/r/618504

Change 618505 had a related patch set uploaded (by Filippo Giunchedi; owner: Filippo Giunchedi):
[operations/debs/prometheus-icinga-exporter@master] Add support for exposing Icinga problems as metrics

https://gerrit.wikimedia.org/r/618505

Change 618504 merged by Cwhite:
[operations/debs/prometheus-icinga-exporter@master] Init retry_count at each collection

https://gerrit.wikimedia.org/r/618504

Change 618505 merged by Cwhite:
[operations/debs/prometheus-icinga-exporter@master] Add support for exposing Icinga problems as metrics

https://gerrit.wikimedia.org/r/618505

prometheus-icinga-exporter 0.8 deployed

Change 618764 had a related patch set uploaded (by Filippo Giunchedi; owner: Filippo Giunchedi):
[operations/debs/karma@master] Add Debian packaging

https://gerrit.wikimedia.org/r/618764

Change 618764 merged by Filippo Giunchedi:
[operations/debs/karma@master] Add Debian packaging

https://gerrit.wikimedia.org/r/618764

Change 617688 merged by Filippo Giunchedi:
[operations/puppet@production] alertmanager: add IRC notifier

https://gerrit.wikimedia.org/r/617688

Change 617689 merged by Filippo Giunchedi:
[operations/puppet@production] role: add alertmanager::irc to alerting_host

https://gerrit.wikimedia.org/r/617689

Change 619295 had a related patch set uploaded (by Filippo Giunchedi; owner: Filippo Giunchedi):
[operations/puppet@production] prometheus: Introduce Alertmanager

https://gerrit.wikimedia.org/r/619295

Change 619295 merged by Filippo Giunchedi:
[operations/puppet@production] prometheus: Introduce Alertmanager

https://gerrit.wikimedia.org/r/619295

Change 619712 had a related patch set uploaded (by Filippo Giunchedi; owner: Filippo Giunchedi):
[operations/puppet@production] alertmanager: add advertise-address cluster option

https://gerrit.wikimedia.org/r/619712

Change 619712 merged by Filippo Giunchedi:
[operations/puppet@production] alertmanager: add advertise-address cluster option

https://gerrit.wikimedia.org/r/619712

fgiunchedi updated the task description. (Show Details)Wed, Aug 12, 10:05 AM

Change 619737 had a related patch set uploaded (by Filippo Giunchedi; owner: Filippo Giunchedi):
[operations/puppet@production] alertmanager: allow access from all Prometheis

https://gerrit.wikimedia.org/r/619737

Change 619738 had a related patch set uploaded (by Filippo Giunchedi; owner: Filippo Giunchedi):
[operations/puppet@production] prometheus: add alertmanager jobs

https://gerrit.wikimedia.org/r/619738

Change 619739 had a related patch set uploaded (by Filippo Giunchedi; owner: Filippo Giunchedi):
[operations/puppet@production] prometheus: add alertmanagers configuration

https://gerrit.wikimedia.org/r/619739

Change 619737 merged by Filippo Giunchedi:
[operations/puppet@production] alertmanager: allow access from all Prometheis

https://gerrit.wikimedia.org/r/619737

Change 619752 had a related patch set uploaded (by Filippo Giunchedi; owner: Filippo Giunchedi):
[operations/dns@master] templates: add alerts.w.o

https://gerrit.wikimedia.org/r/619752