Page MenuHomePhabricator

Improve port-utilisation alerting to take QoS into account
Closed, ResolvedPublic

Description

With the introduction of QoS profiling to some traffic flows across the network we now have packets transmitted in our "low priority" class. The biggest use of this currently is within the cloud network for their ceph traffic, but this will likely also be replicated for production analytics and other bulk-data flows (for instance T381389).

In this scenario we might want to consider how we alert on link usage, as we now have a situation where we might max link bandwidth, but we don't care as much as no "important" traffic is being dropped. Paging everyone if we see high usage for a short time (perhaps due to a large analytics job, or ceph cluster rebalance), might not make sense in that scenario.

Overall this it's a tricky balance. What I think we can possibly do for now is:

  • Leave the LibreNMS alert in place
  • Introduce a new alertmanager rule, based on our drop counters, which pages if we have significant drops in our "normal" or "high" categories
    • This is indicative of link saturation, but in some ways a more accurate measure of issues
    • We need to decide whether to look at just tail drops or also include RED drops
  • Introduce a new alertmanger rule, based on on the overall link usage, that pages if we are saturating a link for an extended period
    • Hard to say what the right window for this is
    • But basically even if there is mostly "low" profile traffic we don't want links that are permanently running hot
    • Unsure if this should page or not

Once we get some confidence in how the new rules are working we can change how we process the LibreNMS rule and not have it page. Interested to hear other views here too.

Event Timeline

cmooney triaged this task as Medium priority.
cmooney renamed this task from Migrate port utilisation alert from LibreNMS to alertmanage to Migrate port utilisation alert from LibreNMS to alertmanager.Jan 17 2025, 5:39 PM

Looks all good to me !

First start with non-paging, and revisit later on.

I'm wondering if we could re-write the "instance" in Prometheus to match the server name (for example from the interface description). This would help pinpoint more rapidly where the saturation is coming from (if internal).

SGTM too, re: extracting hostname from interface description we could do it via regexp if the extraction/pattern is stable enough. Even better of course if can get the (lldp?) peer from some other metric we could then join via the interface itself

I'm wondering if we could re-write the "instance" in Prometheus to match the server name (for example from the interface description). This would help pinpoint more rapidly where the saturation is coming from (if internal).

This is somewhat similar to the discussion in T384731#10511648. Tbh I think the "instance" should remain the device providing the metrics in all cases. It gets very confusing if we redefine that.

We can return the interface description in the text of the alert text which will give that info to users.

Change #1128429 had a related patch set uploaded (by Ayounsi; author: Ayounsi):

[operations/alerts@master] Add transit/peering in/out port saturation alert

https://gerrit.wikimedia.org/r/1128429

Change #1128429 merged by jenkins-bot:

[operations/alerts@master] Add transit/peering in/out port saturation alert

https://gerrit.wikimedia.org/r/1128429

Change #1130625 had a related patch set uploaded (by Ayounsi; author: Ayounsi):

[operations/alerts@master] Add transit/peering in/out port saturation alert - try 2

https://gerrit.wikimedia.org/r/1130625

Change #1130632 had a related patch set uploaded (by Ayounsi; author: Ayounsi):

[operations/alerts@master] Promote some network alerts from warning to critical

https://gerrit.wikimedia.org/r/1130632

Change #1130965 had a related patch set uploaded (by Ayounsi; author: Ayounsi):

[operations/alerts@master] Add "scope: network" to network related alerts

https://gerrit.wikimedia.org/r/1130965

Change #1130632 merged by jenkins-bot:

[operations/alerts@master] Promote some network alerts from warning to critical

https://gerrit.wikimedia.org/r/1130632

Change #1130965 merged by jenkins-bot:

[operations/alerts@master] Add "scope: network" to network related alerts

https://gerrit.wikimedia.org/r/1130965

Change #1130625 merged by jenkins-bot:

[operations/alerts@master] Add transit/peering in/out port saturation alert - try 2

https://gerrit.wikimedia.org/r/1130625

cmooney renamed this task from Migrate port utilisation alert from LibreNMS to alertmanager to Improve port-utilisation alerting to take QoS into account.Apr 3 2025, 10:45 AM

This has come up again in terms of the pages we have been getting of late, and we may take some action to change our QoS profiling across the WAN as a result.

Very basically what we might want to do is something like:

gnmi_interfaces_interface_state_counters_out_queue_tail_drop_pkts{out_queue_queue_number!="1"} / 
gnmi_interfaces_interface_state_counters_out_queue_pkts{out_queue_queue_number!="1"}

This effectively gives us the ratio of packets the switch dropped versus transmitted, ignoring "queue 1", i.e. the low priority stuff. The idea being that we'd only alert/page if saturation got so bad we had large numbers of drops in the higher-priority queues.

What an appropriate threshold here is the question, however. If I go to the below dashboard and zoom in to the time we had significant drops The "drop percent" (effectively a more complex version of the above) is around 2.51% at max (~20kpps drops out of ~800kpps sent)

https://grafana.wikimedia.org/goto/YOk1qBMDg

In terms of implementing something like this in AlertManager it might be good if we could run any potential rule against historical data and try to work out when in the past it would have fired?

We can set the rule now as non-paging to start collecting data and test it. So we can gain trust in it before flipping it to paging.

https://grafana.wikimedia.org/goto/YOk1qBMDg

In terms of implementing something like this in AlertManager it might be good if we could run any potential rule against historical data and try to work out when in the past it would have fired?

If you already have all history for the expression's metrics then evaluating it on say https://prometheus-eqiad.wikimedia.org/ops will give you a preview: whenever there is data that's when the alert would have fired (modulo the for clause). This in addition to what @ayounsi said re: non-paging then graduate to paging

We can set the rule now as non-paging to start collecting data and test it. So we can gain trust in it before flipping it to paging.

Yeah normally that is definitely the way to do it.

If you already have all history for the expression's metrics then evaluating it on say https://prometheus-eqiad.wikimedia.org/ops will give you a preview: whenever there is data that's when the alert would have fired (modulo the for clause).

Thanks yeah maybe I should create some rules and start trying to test it there. May need some pointers. As the current ask is about how we can reduce overall paging alerts over the break then - if we did decide to do something - we'd need to make our assessment about whether it made sense based on historical data.

Change #1219852 had a related patch set uploaded (by Cathal Mooney; author: Cathal Mooney):

[operations/alerts@master] team-netops: add rule for packet drops in higher-priority queues

https://gerrit.wikimedia.org/r/1219852

Change #1219852 merged by jenkins-bot:

[operations/alerts@master] team-netops: add rule for packet drops in higher-priority queues

https://gerrit.wikimedia.org/r/1219852

Change #1296520 had a related patch set uploaded (by Cathal Mooney; author: Cathal Mooney):

[operations/alerts@master] netops: set CR packet drop alert to paging and up timer on saturation

https://gerrit.wikimedia.org/r/1296520

Change #1296520 merged by jenkins-bot:

[operations/alerts@master] netops: set CR packet drop alert to paging and up timer on saturation

https://gerrit.wikimedia.org/r/1296520

cmooney claimed this task.

Gonna close this one. Alert is in place and firing when we hit 2% drops (versus transmitted packets) in queues other than "low". It will fire after 5 mins, the regular link Saturation one will still kick in after 12 mins.

Based on the past few months this would have fired three times in total, and in all cases the link saturation would also. We can tweak the thresholds and delays as we move forward (as well as adjust what traffic flows we move to 'low' priority and thus are ignored by this).