Page MenuHomePhabricator

IPv6 packet loss registered by the Ripe Atlas anchor in eqsin
Closed, ResolvedPublic0 Story Points

Description

Fork of T227967, in which we are investigating a problem with the mr1-oob interface. It seems that there are a lot of probes (currently 95 probes of 437) that are registering packet loss for the Atlas anchor in eqsin.

Icinga reported the issue at around 2019-07-13T07:30 (UTC).

Anchor port: https://librenms.wikimedia.org/device/device=163/tab=port/port=15447/
Anchor link: https://atlas.ripe.net/measurements/11645088/#!general

Event Timeline

elukey triaged this task as High priority.Jul 15 2019, 8:10 AM
elukey created this task.
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptJul 15 2019, 8:10 AM
elukey updated the task description. (Show Details)Jul 15 2019, 8:13 AM

From my home ipv6 address (removed the first hops):

[..]
  6. AS6939   100ge9-2.core1.par2.he.net              0.0%    10   46.0  49.6  40.9  67.5   9.2
  7. AS6939   100ge6-1.core1.mrs1.he.net              0.0%    10   46.3  53.0  46.3  89.6  13.4
  8. AS6939   100ge14-2.core1.sin1.he.net             0.0%    10  184.5 184.2 183.0 184.8   0.6
  9. AS???    14907.sgw.equinix.com                   0.0%    10  184.3 184.4 183.8 185.2   0.5
 10. AS14907  ripe-atlas-eqsin.wikimedia.org         50.0%    10  184.3 184.7 183.9 185.2   0.6

Then if I try to do the reverse from bast5001:

HOST: bast5001                                       Loss%   Snt   Last   Avg  Best  Wrst StDev
  1. AS14907  2001:df2:e500:1:fe00::1                 0.0%    10    0.3   0.2   0.2   0.3   0.0
  2. AS???    6939.sgw.equinix.com                   80.0%    10    0.3   0.2   0.2   0.3   0.0
  3. AS6939   100ge11-1.core1.mrs1.he.net            70.0%    10  146.0 140.5 137.7 146.0   4.7
  4. AS6939   100ge6-1.core1.mil2.he.net             30.0%    10  144.2 144.2 144.2 144.4   0.0
  5. AS6939   100ge5-2.core1.zrh2.he.net             60.0%    10  148.3 148.4 148.3 148.5   0.0
  6. AS6939   100ge15-1.core1.fra1.he.net            80.0%    10  154.0 154.0 154.0 154.1   0.0
[..]

I checked mtr from all bastions to the anchor and they don't show anything weird. I am wondering if there is a problem in Equinix Singapore?

Mentioned in SAL (#wikimedia-operations) [2019-07-15T20:30:12Z] <XioNoX> deactivate HE peering in eqsin - T228015

Mentioned in SAL (#wikimedia-operations) [2019-07-15T20:47:49Z] <XioNoX> add as-path HE ".* 6939 .*" to AVOID-PATH in eqsin - T228015

Seems like HE in eqsin is having a bad time.
I depref all AS paths that go through HE and packet loss stopped.
Emailed HE's NOC.

Mentioned in SAL (#wikimedia-operations) [2019-07-15T21:06:44Z] <XioNoX> rollback as-path HE ".* 6939 .*" to AVOID-PATH in eqsin - T228015

ayounsi closed this task as Resolved.Jul 15 2019, 9:09 PM
ayounsi claimed this task.

They were very quick to reply and fix the issue.

RECOVERY - IPv6 ping to eqsin on ripe-atlas-eqsin IPv6 is OK: OK