Page MenuHomePhabricator

Setup flow monitoring of *internal* network traffic
Closed, ResolvedPublic

Description

The TSO/GRO problems we had the last few weeks could have been spotted much
earlier if we had sflow/netflow monitoring of our *internal* network for things
like unusually high amount of ICMP errors, TCP retransmits, etc.
--
Mark Bergsma <mark at wikimedia>
Operations Engineering Program Manager
Wikimedia Foundation

Details

Reference
rt1308

Related Objects

StatusSubtypeAssignedTask
Resolvedayounsi

Event Timeline

rtimport raised the priority of this task from to Medium.Dec 18 2014, 12:55 AM
rtimport added a project: netops.
rtimport set Reference to rt1308.

Dependency by ticket #6775 added by gage

fgiunchedi changed the visibility from "WMF-NDA (Project)" to "Public (No Login Required)".Dec 2 2015, 3:19 PM
fgiunchedi changed the edit policy from "WMF-NDA (Project)" to "All Users".
ayounsi subscribed.

Prometheus (that didn't exist in 2011) with netstat provides better visibility on problematic frames/segments/datagrams/packets getting in/out of the servers.
I created two dashboards (still as draft):
https://grafana.wikimedia.org/dashboard/db/network-performances-global
and
https://grafana.wikimedia.org/dashboard/db/network-performances

After investigating the out of the ordinary patterns, we will be able to add alerting on those graphs to be notified when something needs our attention.

Alerts added to the dashboard (not tied to nagios, but shows up in the "single pane of glass" dashboard in LibreNMS.
I think that satisfies the initial request.
More graphs/alerts will be added when needed.