Page MenuHomePhabricator

Improve Alertmanager/LibreNMS notifications
Open, MediumPublic

Description

Some feedback after using Alert Manager as transport for LibreNMS:


When ACKing an alert in LibreNMS, Alert Manager sends an email titled:

[FIRING:1] Access port utilisation over 80% for 1h global (asw2-b-eqiad.mgmt.eqiad.wmnet warning librenms netops)

And only at the very bottom it says:

title = Alert for device asw2-b-eqiad.mgmt.eqiad.wmnet - Access port utilisation over 80% for 1h got acknowledged

Same for IRC alerts:

jinxer-wm> (Access port utilisation over 80% for 1h) firing: Access port utilisation over 80% for 1h - https://alerts.wikimedia.org

Which is confusing as it's an ACK.


LibreNMS IRC bot had colors, but not the Alert Manager one, that's a major regression.


In the email body, all the LibreNMS details are crammed into the "Annotations" section, with no new lines, which make it difficult to parse.
The content is also duplicated. For example:

alertname = Access port utilisation over 80% for 1h
Rule: Access port utilisation over 80% for 1h
summary = Access port utilisation over 80% for 1h
instance = asw2-b-eqiad.mgmt.eqiad.wmnet
Device Name: asw2-b-eqiad.mgmt.eqiad.wmnet


The email title says [FIRING:1], not sure if it's needed or what the :1 means.


[FIRING:1] Inbound interface errors global (asw2-b-eqiad.mgmt.eqiad.wmnet warning librenms netops)

No need for the severity, scope or team in the email title, that's precious real-estate
Something like:

[Alert] asw2-b-eqiad.mgmt.eqiad.wmnet: Inbound interface errors

is more clear.


The "source" link is broken, it links to, for example "http://device/device=175"

Event Timeline

ayounsi triaged this task as Medium priority.Feb 3 2021, 9:41 AM
ayounsi created this task.
fgiunchedi renamed this task from Improve alertmanager notifications to Improve Alertmanager/LibreNMS notifications.Feb 8 2021, 4:13 PM

Another one:

<+jinxer-wm> (Processor usage over 85%) firing: Processor usage over 85% - https://alerts.wikimedia.org
04:58 (Processor usage over 85%) resolved: Processor usage over 85% - https://alerts.wikimedia.org

Doesn't say which host it's for.

Thank you for the feedback! Unfortunately I think addressing some of the feedback will need a librenms patch

(To that end I've also started to package librenms as a Debian package, since I think it is more practical and we don't really need scap anymore).

To address pressing concerns:

Another one:

<+jinxer-wm> (Processor usage over 85%) firing: Processor usage over 85% - https://alerts.wikimedia.org
04:58 (Processor usage over 85%) resolved: Processor usage over 85% - https://alerts.wikimedia.org

Doesn't say which host it's for.

Agreed the host must be in there, I see two ways of achieving this:

  1. patch librenms to always append the hostname to the alert summary
  2. patch librenms to use the alert's title for summary, afaict the title always contains the hostname and can be changed from the web interface. in the example above the title is Alert for device cr1-codfw.wikimedia.org - Primary outbound port utilisation over 80% #page

When ACKing an alert in LibreNMS, Alert Manager sends an email titled:

[FIRING:1] Access port utilisation over 80% for 1h global (asw2-b-eqiad.mgmt.eqiad.wmnet warning librenms netops)

And only at the very bottom it says:

title = Alert for device asw2-b-eqiad.mgmt.eqiad.wmnet - Access port utilisation over 80% for 1h got acknowledged

Same for IRC alerts:

jinxer-wm> (Access port utilisation over 80% for 1h) firing: Access port utilisation over 80% for 1h - https://alerts.wikimedia.org

Which is confusing as it's an ACK.

Agreed, the acks should come from alertmanager / alerts.wikimedia.org itself, IOW alertmanager doesn't expect clients to be ack'ing alerts


LibreNMS IRC bot had colors, but not the Alert Manager one, that's a major regression.

I think the only way we can possibly address this is by putting back the librenms irc bot, and stop sending irc notifications via AM's irc bot, which I'd rather avoid unless absolutely necessary.


In the email body, all the LibreNMS details are crammed into the "Annotations" section, with no new lines, which make it difficult to parse.
The content is also duplicated. For example:

alertname = Access port utilisation over 80% for 1h
Rule: Access port utilisation over 80% for 1h
summary = Access port utilisation over 80% for 1h
instance = asw2-b-eqiad.mgmt.eqiad.wmnet
Device Name: asw2-b-eqiad.mgmt.eqiad.wmnet

Yeah that's quite unreadable, from my understanding and code reading this has to do with the formatting of the alert template plus the alertmanager transport formatting the message. We'll need to look into it as well.

The email title says [FIRING:1], not sure if it's needed or what the :1 means.

FIRING is the alert state (as opposed to RESOLVED) and :1 is how many alerts have been grouped together

[FIRING:1] Inbound interface errors global (asw2-b-eqiad.mgmt.eqiad.wmnet warning librenms netops)

No need for the severity, scope or team in the email title, that's precious real-estate
Something like:

[Alert] asw2-b-eqiad.mgmt.eqiad.wmnet: Inbound interface errors

is more clear.

Agreed, for this the fix could be the same as the IRC message I think (namely changing the summary to the alert's librenms title e.g. Alert for device asw2-a-eqiad.mgmt.eqiad.wmnet - Inbound interface errors


The "source" link is broken, it links to, for example "http://device/device=175"

Indeed, this is an unset base_url librenms configuration which I'll need to look into.

Change 675075 had a related patch set uploaded (by Filippo Giunchedi; author: Filippo Giunchedi):
[operations/software/librenms@upstream-1.66] Use 'title' as Alertmanager summary

https://gerrit.wikimedia.org/r/675075

Change 675076 had a related patch set uploaded (by Filippo Giunchedi; author: Filippo Giunchedi):
[operations/puppet@production] librenms: add base_url setting

https://gerrit.wikimedia.org/r/675076

Change 675076 merged by Filippo Giunchedi:
[operations/puppet@production] librenms: add base_url setting

https://gerrit.wikimedia.org/r/675076

Note that I also edited the current LibreNMS alert templates to remove useless (or duplicated info).

One thing I find weird is that the AM emails don't include a timestamp, the only one present comes from LibreNMS.

Change 675075 merged by Filippo Giunchedi:
[operations/software/librenms@upstream-1.66] Use 'title' as Alertmanager summary

https://gerrit.wikimedia.org/r/675075

Change 675097 had a related patch set uploaded (by Filippo Giunchedi; author: Filippo Giunchedi):
[operations/software/librenms@upstream-1.66] Use device's 'alerts' page as Alertmanager 'source' link

https://gerrit.wikimedia.org/r/675097

Change 675097 merged by Filippo Giunchedi:
[operations/software/librenms@upstream-1.66] Use device's 'alerts' page as Alertmanager 'source' link

https://gerrit.wikimedia.org/r/675097

Change 675455 had a related patch set uploaded (by Filippo Giunchedi; author: Filippo Giunchedi):
[operations/software/librenms@upstream-1.66] Add 'timestamp' annotation to AM alerts

https://gerrit.wikimedia.org/r/675455

Change 675455 merged by Filippo Giunchedi:
[operations/software/librenms@upstream-1.66] Add 'timestamp' annotation to AM alerts

https://gerrit.wikimedia.org/r/675455