Page MenuHomePhabricator

Create an alert for output discards on network devices
Closed, ResolvedPublic

Description

While investigating a performance issue for backups between eqiad and codfw (T274234), it was discovered that there were output drops on ToR switches in eqiad, as can be seen here:

cmooney@asw2-c-eqiad> show interfaces xe-2/0/46 detail         
Physical interface: xe-2/0/46, Enabled, Physical link is Up
  Interface index: 918, SNMP ifIndex: 602, Generation: 609
  Description: Core: cr2-eqiad:xe-3/0/2 {#3464}
  Link-level type: Ethernet, MTU: 9192, MRU: 0, Speed: 10Gbps, BPDU Error: None, MAC-REWRITE Error: None, Loopback: Disabled, Source filtering: Disabled, Flow control: Disabled, Media type: Fiber
  Device flags   : Present Running
  Interface flags: SNMP-Traps Internal: 0x4000
  Link flags     : None
  CoS queues     : 12 supported, 12 maximum usable queues
  Hold-times     : Up 0 ms, Down 0 ms
  Current address: 4c:16:fc:fb:9d:72, Hardware address: 4c:16:fc:fb:9c:b1
  Last flapped   : 2020-09-02 09:59:19 UTC (39w6d 08:53 ago)
  Statistics last cleared: 2021-06-08 17:54:19 UTC (00:58:35 ago)
  Traffic statistics:
   Input  bytes  :         352431957036            645678192 bps
   Output bytes  :        1455185482573           3198397656 bps
   Input  packets:            523378678               138276 pps
   Output packets:           1682927702               465949 pps
   IPv6 transit statistics:
    Input  bytes  :                   0
    Output bytes  :                   0
    Input  packets:                   0
    Output packets:                   0
  Egress queues: 12 supported, 5 in use
  Queue counters:       Queued packets  Transmitted packets      Dropped packets
    0                                0           1683660591               335031
    3                                0                    0                    0
    4                                0                    0                    0
    7                                0                 4870                    0
    8                                0                20654                    0
  Queue number:         Mapped forwarding classes
    0                   best-effort
    3                   fcoe
    4                   no-loss
    7                   network-control
    8                   mcast

These manifest in the SNMP metrics as Output Discards:

https://librenms.wikimedia.org/graphs/to=1623178200/id=15215/type=port_errors/from=1620499800/

We should probably alert if we see a lot of these, creating this task to track progress

Event Timeline

LibreNMS doesn't expose ifOutDiscards in its alert criteria so I had to write a custom SQL alert.

SELECT distinct hostname
FROM devices,ports,ports_statistics
WHERE (ports.port_descr_type = "core" AND devices.device_id = ports.device_id AND ports_statistics.port_id = ports.port_id)
AND ports_statistics.ifOutDiscards_delta != 0;

Returns:

cloudsw1-c8-eqiad
asw2-c-eqiad
asw2-d-eqiad
asw2-a-eqiad
asw2-b-eqiad

I disabled the LibreNMS alert for now until we need it.

ports_statistics.ifOutDiscards_delta != 0 (counter increment between 2 SNMP queries, usually 5min) will also need to be tuned. Not too little to not alert on exceptional and brief discards. Not too high to catch problematic links and not have the alert to flap.

Current values for ifOutDiscards_delta:

'asw2-b-eqiad.mgmt.eqiad.wmnet', 'xe-7/0/41', '509'
'asw2-b-eqiad.mgmt.eqiad.wmnet', 'xe-2/0/41', '1601'
'cloudsw1-c8-eqiad.mgmt.eqiad.wmnet', 'xe-0/0/1', '113563'
'cloudsw1-c8-eqiad.mgmt.eqiad.wmnet', 'xe-0/0/2', '88593'

asw2-b-eqiad has the proper buffer values, while cloudsw1-c8-eqiad doesn't yet.

So I'm setting the threshold to 10000 for the asw* devices, and will monitor it for a bit.

cmooney triaged this task as Medium priority.Aug 27 2021, 7:38 AM
joanna_borun changed the task status from Open to In Progress.Sep 21 2021, 4:00 PM

This is now set to alert to NOC through alertmanager.

Added a quick mention in https://wikitech.wikimedia.org/wiki/Network_monitoring#Outbound_discards as well.

We can tune down thresholds once we reduce the amount of discards.