Page MenuHomePhabricator

Network port utilization alerts should be paging
Closed, ResolvedPublic

Description

We had a port saturation incident today: https://wikitech.wikimedia.org/wiki/Incident_documentation/20190603-eqiad-port-saturation

Although Librenms did alert by sending an email to noc@wikimedia.org (Subject: Alert for device cr2-eqiad.wikimedia.org - Primary outbound port utilisation over 80%), port utilization alerts should be paging.

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald Transcript
ema triaged this task as Medium priority.Jun 3 2019, 2:44 PM

I agree!

That's all the "transports" LibreNMS alerting can use: https://docs.librenms.org/Alerting/Transports

I'm not familiar with our paging system. If any of the above can be used to send pages that would be ideal, otherwise we would have to write our own.

There is a "Nagios Compatible" transport, but it is underdocumented and seems to also only write to a local filesystem path (which is presumed to be a Nagios external command FIFO).

It seems more like we'd have to write a Nagios check_command that would scrape LibreNMS's alerts API and fire an Icinga alert in that case, much like we have for Grafana alerts.

I've a proposal for doing this:

  • Add some special tag like #NRPE or #page to the names of any LibreNMS alert rules we'd like to make page. For our purpose here this would just be #6 Primary outbound port utilisation over 80% and #25 Primary inbound port utilisation over 80%.
  • In a Python NRPE:

This will prevent turning any LibreNMS critical into a page for the whole team (e.g. the currently-firing "Sensor over limit" for cr3-esams). It will also mean that ACKing alerts within LibreNMS does the right thing. And it makes it fairly straightforward to add/remove alert rules from the set that pages the team.

SGTU?

I've a proposal for doing this:

  • Add some special tag like #NRPE or #page to the names of any LibreNMS alert rules we'd like to make page. For our purpose here this would just be #6 Primary outbound port utilisation over 80% and #25 Primary inbound port utilisation over 80%.
  • In a Python NRPE:

This will prevent turning any LibreNMS critical into a page for the whole team (e.g. the currently-firing "Sensor over limit" for cr3-esams). It will also mean that ACKing alerts within LibreNMS does the right thing. And it makes it fairly straightforward to add/remove alert rules from the set that pages the team.

SGTU?

Sounds great to me! I am assuming on the icinga side it'll be only one alert at least to start with, for e.g. silencing purposes which I think will work fine for now.

Sounds great to me! I am assuming on the icinga side it'll be only one alert at least to start with, for e.g. silencing purposes which I think will work fine for now.

Yeah, I think always one alert on the icinga side, with the usual technique of varying the output string based on which LibreNMS alerts are firing, so if one instance gets acknowledged and then another port saturates, it will re-alert in Icinga. (Plus, ACKing an alert in LibreNMS will 'work'.)

@ayounsi does all the above SGTU?

Any preferences or thoughts re: the special tag? Right now I'm leaning towards #page as that seems the most self-explanatory.

Any preferences or thoughts re: the special tag? Right now I'm leaning towards #page as that seems the most self-explanatory.

+1 on #page

That looks good! We might want to create a specific LibreNMS alert for the transit/peering links only, but can start with the existing ones.

Note that LibreNMS has a 5min granularity, so ideally the Icinga check should be more frequent while not overwhelming LibreNMS.

That looks good! We might want to create a specific LibreNMS alert for the transit/peering links only, but can start with the existing ones.

Makes sense. Should be easy enough to do (I think I even saw how to do it, if I figured out macro definitions). The relevant fragment of the current alert rule gets evaluated like so: ports.ifAlias REGEXP \"^(Cust|Transit|Peering|Core|Transport).*\") = 1

Anyway, would be easy to split the current alert rules into two, and make only one of them have the #page annotation.

Note that LibreNMS has a 5min granularity, so ideally the Icinga check should be more frequent while not overwhelming LibreNMS.

+1. It's just a few API calls, and they return rather quickly in my testing from my workstation with curl. 1x/minute seems good to me.

Change 566789 had a related patch set uploaded (by CDanis; owner: CDanis):
[labs/private@master] add API key for scraping of LibreNMS's API by Icinga

https://gerrit.wikimedia.org/r/566789

Change 566789 merged by CDanis:
[labs/private@master] add API key for scraping of LibreNMS's API by Icinga

https://gerrit.wikimedia.org/r/566789

Change 566888 had a related patch set uploaded (by CDanis; owner: CDanis):
[operations/puppet@production] Icinga alert for LibreNMS critical alerts

https://gerrit.wikimedia.org/r/566888

Change 566888 merged by CDanis:
[operations/puppet@production] Icinga alert for LibreNMS critical alerts

https://gerrit.wikimedia.org/r/566888

Looks good so far: https://icinga.wikimedia.org/cgi-bin/icinga/extinfo.cgi?type=2&host=icinga1001&service=Check+alerts+defined+in+LibreNMS

I also modified the inbound/outbound port utilisation alert rules in LibreNMS to include the magic word #page.

Once this has been running for a while I'll make this alert nagios_critical=>1 in Icinga so it does in fact page. (Right now it's just IRC notifications... which will include the keyword #page themselves, so will actually hotword SREs who have opted into that... FYI.)

Change 570371 had a related patch set uploaded (by CDanis; owner: CDanis):
[operations/puppet@production] librenms API scrape alert: make critical

https://gerrit.wikimedia.org/r/570371

Change 570371 merged by CDanis:
[operations/puppet@production] librenms API scrape alert: make critical & change name

https://gerrit.wikimedia.org/r/570371