Page MenuHomePhabricator

2021-08-26 Primary inbound port utilisation over 80% page for mr1-esams.wikimedia.org
Closed, ResolvedPublic

Description

At 23:07 a page went off for "Primary inbound port utilisation over 80%" for mr1-esams.wikimedia.org. It resolved itself in 5 minutes.

I think the past two times this page went off @ayounsi said this was a false positive and adjusted the paging rules as a result. Filing just in case this needs similar treatment.

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald Transcript

Commenting as I think @ayounsi will not have been CCed on the original Phab report, for him to triage.

I had a look at this this morning (didn't catch the page when it fired and it cleared quickly as you say). Seems to be ok, but I had made a note to discuss with @ayounsi on his return.

It may well be legitimate management traffic from some server (the port that alerted - ge-0/0/3 - connects to msw-oe15-esams, but that's unmanaged, so can't be sure of the source/dest of the traffic).

CPU usage may also be a factor. Those mr devices have SSH open on OOB circuit (which we wish to change - see T277438), but the CPU can sometimes spike due to failed attempts. There are gaps in the graphs which may be due to unsuccessful polls due to CPU spikes. The average BW shown in LibreNMS that made this fire exceeds the physical link capacity, which is odd, and may be related.

Also see a higher number of flows on the router at the time.

image.png (592×1 px, 104 KB)

I removed the management routers from the wrong alert, that's why we got paged again. It's now fixed so it won't page while we investigate.

Marostegui assigned this task to ayounsi.
Marostegui subscribed.

Fixed per the above comment