Page MenuHomePhabricator

Transport link saturation not alerting
Closed, ResolvedPublic

Description

After some internal discussions with @cmooney and @Vgutierrez, we are looking into why there were no alerts for the recent transport link saturation between magru and eqiad.

The magru and eqiad transport link (Telxius) is a 10G link and were clearly saturating it during a recent incident. (Grafana link, provided by Valentin)

2025-11-05-130231_1893x809.png (809×1 px, 182 KB)

In discussion with Cathal, it seems like we are alerting for transit and peering but not for the transport links themselves, as per https://gerrit.wikimedia.org/r/plugins/gitiles/operations/alerts/+/refs/heads/master/team-netops/interfaces.yaml#49.

expr: |
  (
  irate(gnmi_interfaces_interface_state_counters_out_octets{instance=~"cr.*", interface_description=~"(Transit|Peering).*"}[5m])
  /
  (gnmi_interfaces_interface_state_high_speed{instance=~"cr.*", interface_description=~"(Transit|Peering).*"}/8*1000000)
  ) > 0.9

We should be expanding this alerting to the transport links as well and it should be a paging alert, like the rest of the rule. In the absence of such an alerting, we are either notified of this through purged lag alerts, or some other traffic patterns, and that may not be ideal.

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald Transcript
ssingh triaged this task as High priority.Nov 5 2025, 6:12 PM

Thanks for the task @ssingh !

I agree this is definitely a major gap. In terms of the alertmanager rule you list it does make sense we should have another one (or expand it) to also cover transport / private WAN circuit. So we can absolutely do that.

Historically these alerts were triggered for us by LibreNMS. Checking there it's pretty obvious why those are no longer firing - they are turned off!

image.png (270×1 px, 86 KB)

It's not at all clear to me when or why those were disabled. But in any event I have re-enabled them now which I think should make alerting work again. This setup may also explain why my attempts to get LibreNMS to alert at lower-than-line-rate for eqsin didn't work - the damn things were disabled completely!

Ultimately we can move these to alertmanager, we can work on what those alerts look like. And potentially move to basing them on dropped outbound packets taking QoS priority into account (T384052).

Thanks for the task @ssingh !

I agree this is definitely a major gap. In terms of the alertmanager rule you list it does make sense we should have another one (or expand it) to also cover transport / private WAN circuit. So we can absolutely do that.

Historically these alerts were triggered for us by LibreNMS. Checking there it's pretty obvious why those are no longer firing - they are turned off!

image.png (270×1 px, 86 KB)

It's not at all clear to me when or why those were disabled. But in any event I have re-enabled them now which I think should make alerting work again. This setup may also explain why my attempts to get LibreNMS to alert at lower-than-line-rate for eqsin didn't work - the damn things were disabled completely!

Ultimately we can move these to alertmanager, we can work on what those alerts look like. And potentially move to basing them on dropped outbound packets taking QoS priority into account (T384052).

Thanks for the update and the fix in LibreNMS! I think that gives us some notifications for now and given it is paging, it's a decentstop-gap solution till we move these to AlertManager.

ayounsi subscribed.

My bad ! I turned them off after adding the transit/peering saturation alerts. Forgetting transport and core links.... I'll take care of them.

Change #1206849 had a related patch set uploaded (by Ayounsi; author: Ayounsi):

[operations/alerts@master] Outbound saturation: add transport interfaces

https://gerrit.wikimedia.org/r/1206849

Change #1206855 had a related patch set uploaded (by Ayounsi; author: Ayounsi):

[operations/alerts@master] Add alerting for core link saturation

https://gerrit.wikimedia.org/r/1206855

Change #1206849 merged by jenkins-bot:

[operations/alerts@master] Outbound saturation: add transport interfaces

https://gerrit.wikimedia.org/r/1206849

Change #1206855 merged by jenkins-bot:

[operations/alerts@master] Add alerting for core link saturation

https://gerrit.wikimedia.org/r/1206855

Paging alerting added. I won't disable the LibreNMS one for now, but only in the future to make sure the new one works fine.