Page MenuHomePhabricator

Slight packet loss observed on the network starting Nov 2016
Closed, ResolvedPublic

Description

Smokeping has alerted about increased packet loss lately for a selection of hosts/devices, e.g. bast3001 / cr1-eqdfw / asw-a-codfw.
The loss is minimal and infrequent but sometimes enough to trigger alerts, it is also evident on yearly graphs from smokeping, e.g.

From a quick look it seems ulsfo isn't affected but eqdfw / codfw / esams are. I took a closer look at codfw and smokeping-wise core routers are not experiencing loss. Though access switches asw-a-codfw asw-c-codfw show up as lossy while asw-b-codfw and asw-d-codfw are not.

Event Timeline

Ottomata triaged this task as Medium priority.Mar 6 2017, 6:43 PM
14:49  <elukey> not sure if this makes any sense but I did the following
14:49  <elukey> mtr tegmen.wikimedia.org from netmon1001
14:50  <elukey> (one of the targets of smokeping showing loss)
14:50  <elukey> followed the path on cr2 and checked the phy interface statistics
14:50  <elukey> first ae2, then xe-3/2/3
14:51  <elukey> that shows something like
14:51  <elukey>   Queue counters:       Queued packets  Transmitted packets      Dropped packets
14:51  <elukey>     0                    2247327783933        2247327741023                42910
ayounsi claimed this task.

XioNoX> I'm secretly hoping that T154507 was caused by T162199, it's on the path, and the LACP hashing algorithm would explain why only some destinations were affected
paravoid> XioNoX: that's a pretty plausible explanation!
and the timeline matches as well
matches pretty accurately too
https://librenms.wikimedia.org/graphs/to=1491845700/id=1572/type=port_errors/from=1460309700/

It's still recent, but so far no more packet loss in smokeping.
Closing the ticket, don't hesitate to reopen if the symptoms are still there.

This is great to see and a very good catch. Nice work @ayounsi!

Indeed, thanks a lot @ayounsi for fixing this long-standing issue!