Page MenuHomePhabricator

Slight packet loss observed on the network starting Nov 2016
Closed, ResolvedPublic


Smokeping has alerted about increased packet loss lately for a selection of hosts/devices, e.g. bast3001 / cr1-eqdfw / asw-a-codfw.
The loss is minimal and infrequent but sometimes enough to trigger alerts, it is also evident on yearly graphs from smokeping, e.g.

From a quick look it seems ulsfo isn't affected but eqdfw / codfw / esams are. I took a closer look at codfw and smokeping-wise core routers are not experiencing loss. Though access switches asw-a-codfw asw-c-codfw show up as lossy while asw-b-codfw and asw-d-codfw are not.

Event Timeline

Ottomata triaged this task as Medium priority.Mar 6 2017, 6:43 PM
14:49  <elukey> not sure if this makes any sense but I did the following
14:49  <elukey> mtr from netmon1001
14:50  <elukey> (one of the targets of smokeping showing loss)
14:50  <elukey> followed the path on cr2 and checked the phy interface statistics
14:50  <elukey> first ae2, then xe-3/2/3
14:51  <elukey> that shows something like
14:51  <elukey>   Queue counters:       Queued packets  Transmitted packets      Dropped packets
14:51  <elukey>     0                    2247327783933        2247327741023                42910
ayounsi claimed this task.

XioNoX> I'm secretly hoping that T154507 was caused by T162199, it's on the path, and the LACP hashing algorithm would explain why only some destinations were affected
paravoid> XioNoX: that's a pretty plausible explanation!
and the timeline matches as well
matches pretty accurately too

It's still recent, but so far no more packet loss in smokeping.
Closing the ticket, don't hesitate to reopen if the symptoms are still there.

This is great to see and a very good catch. Nice work @ayounsi!

Indeed, thanks a lot @ayounsi for fixing this long-standing issue!