The TSO/GRO problems we had the last few weeks could have been spotted much
earlier if we had sflow/netflow monitoring of our *internal* network for things
like unusually high amount of ICMP errors, TCP retransmits, etc.
--
Mark Bergsma <mark at wikimedia>
Operations Engineering Program Manager
Wikimedia Foundation
Description
Description
Details
Details
- Reference
- rt1308
Status | Subtype | Assigned | Task | ||
---|---|---|---|---|---|
Restricted Task | |||||
Resolved | ayounsi | T79755 Setup flow monitoring of *internal* network traffic |
Event Timeline
Comment Actions
Prometheus (that didn't exist in 2011) with netstat provides better visibility on problematic frames/segments/datagrams/packets getting in/out of the servers.
I created two dashboards (still as draft):
https://grafana.wikimedia.org/dashboard/db/network-performances-global
and
https://grafana.wikimedia.org/dashboard/db/network-performances
After investigating the out of the ordinary patterns, we will be able to add alerting on those graphs to be notified when something needs our attention.
Comment Actions
Alerts added to the dashboard (not tied to nagios, but shows up in the "single pane of glass" dashboard in LibreNMS.
I think that satisfies the initial request.
More graphs/alerts will be added when needed.