Page MenuHomePhabricator

mr1 port utilization alerts shouldn't mention hash page in their IRC logs
Open, MediumPublic

Description

Hello people,

today some alarms fired for mr1-eqsin:

07:31  <jinxer-wm> (Primary outbound port utilisation over 80%  #page) firing: Primary outbound port utilisation 
                   over 80%  #page - https://alerts.wikimedia.org
07:31  <jinxer-wm> (Primary inbound port utilisation over 80%  #page) firing: Primary inbound port utilisation over 
                   80%  #page - https://alerts.wikimedia.org
07:36 +<jinxer-wm> (Primary outbound port utilisation over 80%  #page) resolved: Primary outbound port utilisation 
                   over 80%  #page - https://alerts.wikimedia.org
07:36 +<jinxer-wm> (Primary inbound port utilisation over 80%  #page) resolved: Primary inbound port utilisation 
                   over 80%  #page - https://alerts.wikimedia.org

IIUC this alarm is meant to not page (and indeed no SMS from VictorOps) but the IRC log mentions #page anyway, so it may be confusing for people checking IRC to spot anomalies.

Event Timeline

Legoktm renamed this task from mr1 port utilization alerts shouldn't mention "#page" in their IRC logs to mr1 port utilization alerts shouldn't mention hash page in their IRC logs.Apr 25 2021, 8:41 AM

I agree, we should be restricting #page to alerts that page folks, not sure of an alternative tag though (or remove the tag altogether for now) cc @ayounsi

@CDanis set it up, there is a Icinga check that pulls the LibreNMS api and should page where #page is present. But should not page for management routers.
@fgiunchedi Maybe that's something now doable directly through Alert Manager instead? (and we can stop using the #page tag?)

For the record, the #page alerts from management routers should be fixed with T278289

WARNING: don't use the "hash + page" keyword in here since Wikibugs will display it on IRC :D (I already made a mistake when creating the task)

@CDanis set it up, there is a Icinga check that pulls the LibreNMS api and should page where # page is present. But should not page for management routers.
@fgiunchedi Maybe that's something now doable directly through Alert Manager instead? (and we can stop using the # page tag?)

Yes I think at this point we should page through AM for librenms alerts and ditch the icinga check, now tracked in T281095

akosiaris triaged this task as Medium priority.Apr 26 2021, 9:59 AM

Moving to AM sounds good to me. But if needed, in the interim we could change the magic string we use in check_librenms to something else instead of hash page, which I chose for simplicity but has maybe just caused more confusion.

All we'd have to do is to change the --escalation-pattern flag value and also change the names of the alert rules in LibreNMS.

Moving to AM sounds good to me. But if needed, in the interim we could change the magic string we use in check_librenms to something else instead of hash page, which I chose for simplicity but has maybe just caused more confusion.

All we'd have to do is to change the --escalation-pattern flag value and also change the names of the alert rules in LibreNMS.

SGTM to change the hashtag and pattern in the meantime. I won't have time to look into it shortly but happy to review patches