With the introduction of QoS profiling to some traffic flows across the network we now have packets transmitted in our "low priority" class. The biggest use of this currently is within the cloud network for their ceph traffic, but this will likely also be replicated for production analytics and other bulk-data flows (for instance T381389).
In this scenario we might want to consider how we alert on link usage, as we now have a situation where we might max link bandwidth, but we don't care as much as no "important" traffic is being dropped. Paging everyone if we see high usage for a short time (perhaps due to a large analytics job, or ceph cluster rebalance), might not make sense in that scenario.
Overall this it's a tricky balance. What I think we can possibly do for now is:
- Leave the LibreNMS alert in place
- Introduce a new alertmanager rule, based on our drop counters, which pages if we have significant drops in our "normal" or "high" categories
- This is indicative of link saturation, but in some ways a more accurate measure of issues
- We need to decide whether to look at just tail drops or also include RED drops
- Introduce a new alertmanger rule, based on on the overall link usage, that pages if we are saturating a link for an extended period
- Hard to say what the right window for this is
- But basically even if there is mostly "low" profile traffic we don't want links that are permanently running hot
- Unsure if this should page or not
Once we get some confidence in how the new rules are working we can change how we process the LibreNMS rule and not have it page. Interested to hear other views here too.