LibreNMS monitoring glitch caused paging
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	ayounsi
	May 13 2020, 9:37 AM

Description

The monitoring glitch showed in https://librenms.wikimedia.org/graphs/to=1589328000/id=19265/type=port_bits/from=1589320800/

Caused the following faulty alert in LibreNMS:

Alert for device asw2-esams.mgmt.esams.wmnet - Primary outbound port utilisation over 80%  #page
Device Name: asw2-esams.mgmt.esams.wmnet
Severity: critical
Timestamp: 2020-05-12 23:10:55
Rule:  Primary outbound port utilisation over 80%  #page
Physical Interface: ae2
Interface Description: Core: cr2-esams:ae1
Interface Speed: 80 Gbs
Inbound Utilization: 5.43847256
Outbound Utilization: 395288.13516505

The alert stayed active for 10min:

t=0 device is pulled: alert triggered
t=5 device is pulled: no recoveries
t=10 device is pulled: recoveries

This was enough to trigger the related Icinga alert.

My guess is a bug in asw2-esams SNMP daemon (see also all the VCPs)

As:

A switch stack upgrade (especially in esams) is a heavy operation (requires site depool), see T252631
Implementing a safeguard in LibreNMS' code is time consuming (if doable)

We can either live with it until the upgrade (low probability of pages, no operational impact), or find a workaround.
For example:

Split the Primary outbound port utilisation over 80% alert in two, one that pages (for CRs) and one that alerts normally (for everything else)

I don't think a LibreNMS upgrade would have helped, but doesn't hurt. See T251222.

Related Objects

Mentioned In: T252631: Upgrade Junos on asw2-esams
Mentioned Here: T252631: Upgrade Junos on asw2-esams
T251222: Upgrade LibreNMS to 1.63

Event Timeline

ayounsi triaged this task as Medium priority.May 13 2020, 9:37 AM

ayounsi created this task.

Restricted Application added a project: SRE. · View Herald TranscriptMay 13 2020, 9:37 AM

Restricted Application added a subscriber: Aklapper. · View Herald Transcript

ayounsi mentioned this in T252631: Upgrade Junos on asw2-esams.May 13 2020, 9:39 AM

ayounsi updated the task description. (Show Details)

This happened again just now.

Something else we could do as a temporary mitigation is just accept a longer time-to-page in legitimate incidents and increase retries to require 15 minutes before a page. Or (Arzhel's suggestion) configure the LibreNMS alert to also require <101% utilization to be an alert.

I added an extra condition to the inbound and outbound alerts to trigger: usage needs to be < 150%. Which is way bellow the crazy % we saw when the bug happens.

This should mitigates this specific issue until the switch stack gets upgraded.

fgiunchedi moved this task from Inbox to Backlog on the observability board.Jul 20 2020, 1:10 PM

With the mitigation and the task to upgrade the router it's fine to close that one.

LibreNMS monitoring glitch caused pagingClosed, ResolvedPublicActions

Description

Related Objects

Event Timeline

LibreNMS monitoring glitch caused paging
Closed, ResolvedPublic
Actions