Page MenuHomePhabricator

Network port utilization alerts should be paging
Open, MediumPublic

Description

We had a port saturation incident today: https://wikitech.wikimedia.org/wiki/Incident_documentation/20190603-eqiad-port-saturation

Although Librenms did alert by sending an email to noc@wikimedia.org (Subject: Alert for device cr2-eqiad.wikimedia.org - Primary outbound port utilisation over 80%), port utilization alerts should be paging.

Event Timeline

ema created this task.Jun 3 2019, 2:44 PM
Restricted Application added a project: Operations. · View Herald TranscriptJun 3 2019, 2:44 PM
Restricted Application added a subscriber: Aklapper. · View Herald Transcript
ema triaged this task as Medium priority.Jun 3 2019, 2:44 PM
Paladox added a subscriber: Paladox.Jun 3 2019, 2:45 PM
ayounsi added a subscriber: ayounsi.Jun 3 2019, 3:00 PM

I agree!

That's all the "transports" LibreNMS alerting can use: https://docs.librenms.org/Alerting/Transports

I'm not familiar with our paging system. If any of the above can be used to send pages that would be ideal, otherwise we would have to write our own.

ema moved this task from Triage to Network on the Traffic board.Jun 3 2019, 3:09 PM
CDanis added a subscriber: CDanis.Jun 3 2019, 6:30 PM

There is a "Nagios Compatible" transport, but it is underdocumented and seems to also only write to a local filesystem path (which is presumed to be a Nagios external command FIFO).

It seems more like we'd have to write a Nagios check_command that would scrape LibreNMS's alerts API and fire an Icinga alert in that case, much like we have for Grafana alerts.

CDanis added a subscriber: fgiunchedi.EditedNov 25 2019, 4:52 PM

I've a proposal for doing this:

  • Add some special tag like #NRPE or #page to the names of any LibreNMS alert rules we'd like to make page. For our purpose here this would just be #6 Primary outbound port utilisation over 80% and #25 Primary inbound port utilisation over 80%.
  • In a Python NRPE:

This will prevent turning any LibreNMS critical into a page for the whole team (e.g. the currently-firing "Sensor over limit" for cr3-esams). It will also mean that ACKing alerts within LibreNMS does the right thing. And it makes it fairly straightforward to add/remove alert rules from the set that pages the team.

SGTU?

I've a proposal for doing this:

  • Add some special tag like #NRPE or #page to the names of any LibreNMS alert rules we'd like to make page. For our purpose here this would just be #6 Primary outbound port utilisation over 80% and #25 Primary inbound port utilisation over 80%.
  • In a Python NRPE:

This will prevent turning any LibreNMS critical into a page for the whole team (e.g. the currently-firing "Sensor over limit" for cr3-esams). It will also mean that ACKing alerts within LibreNMS does the right thing. And it makes it fairly straightforward to add/remove alert rules from the set that pages the team.
SGTU?

Sounds great to me! I am assuming on the icinga side it'll be only one alert at least to start with, for e.g. silencing purposes which I think will work fine for now.

Sounds great to me! I am assuming on the icinga side it'll be only one alert at least to start with, for e.g. silencing purposes which I think will work fine for now.

Yeah, I think always one alert on the icinga side, with the usual technique of varying the output string based on which LibreNMS alerts are firing, so if one instance gets acknowledged and then another port saturates, it will re-alert in Icinga. (Plus, ACKing an alert in LibreNMS will 'work'.)

@ayounsi does all the above SGTU?

Any preferences or thoughts re: the special tag? Right now I'm leaning towards #page as that seems the most self-explanatory.

Any preferences or thoughts re: the special tag? Right now I'm leaning towards #page as that seems the most self-explanatory.

+1 on #page

That looks good! We might want to create a specific LibreNMS alert for the transit/peering links only, but can start with the existing ones.

Note that LibreNMS has a 5min granularity, so ideally the Icinga check should be more frequent while not overwhelming LibreNMS.

That looks good! We might want to create a specific LibreNMS alert for the transit/peering links only, but can start with the existing ones.

Makes sense. Should be easy enough to do (I think I even saw how to do it, if I figured out macro definitions). The relevant fragment of the current alert rule gets evaluated like so: ports.ifAlias REGEXP \"^(Cust|Transit|Peering|Core|Transport).*\") = 1

Anyway, would be easy to split the current alert rules into two, and make only one of them have the #page annotation.

Note that LibreNMS has a 5min granularity, so ideally the Icinga check should be more frequent while not overwhelming LibreNMS.

+1. It's just a few API calls, and they return rather quickly in my testing from my workstation with curl. 1x/minute seems good to me.

fgiunchedi moved this task from Inbox to Up next on the observability board.Dec 9 2019, 11:18 AM
CDanis claimed this task.Dec 19 2019, 3:13 PM