Page MenuHomePhabricator

mr1-eqsin.oob IPv6 connectivity flapping
Closed, ResolvedPublic0 Estimated Story Points

Description

Icinga has been reporting the following alarm flapping from ~ Jul 13th:

PROBLEM - Host mr1-eqsin.oob IPv6 is DOWN: PING CRITICAL - Packet loss = 100%

At the same time, the ripe-atlas-eqsin IPv6 alert has fired too (not flapping though).

From icinga1001 and bast4002 the IPv6 pings are indeed showing a huge packet loss for the mr1's oob IP, but not for the RIPE anchor's one.

Event Timeline

elukey triaged this task as High priority.Jul 14 2019, 8:05 AM

Mentioned in SAL (#wikimedia-operations) [2019-07-14T13:18:39Z] <godog> silence mr1-eqsin.oob IPv6 until tomorrow 8 UTC - T227967

Thanks, email sent to Equinix NOC.
So far I don't think there is a link between the ripe alerts and the oob alerts.

Mentioned in SAL (#wikimedia-operations) [2019-07-15T17:34:18Z] <cdanis> downtime mr1-eqsin.oob IPv6 for 20h T227967

So far I don't think there is a link between the ripe alerts and the oob alerts.

Well, seems like they are, as the return path from mr1 -> icinga1001 goes through HE, nothing we can do there though as Equinix controls that path.

/cc T228015

Seems like fixing T228015 fixed that issue as well.

Marostegui subscribed.

This has been flapping overnight (times in UTC+2):

[02:46:49]  <+icinga-wm>	PROBLEM - Host mr1-eqsin.oob IPv6 is DOWN: PING CRITICAL - Packet loss = 100%
[03:45:47]  <+icinga-wm>	RECOVERY - Host mr1-eqsin.oob IPv6 is UP: PING WARNING - Packet loss = 93%, RTA = 214.58 ms
[03:53:35]  <+icinga-wm>	PROBLEM - Host mr1-eqsin.oob IPv6 is DOWN: PING CRITICAL - Packet loss = 100%
[04:05:19]  <+icinga-wm>	RECOVERY - Host mr1-eqsin.oob IPv6 is UP: PING WARNING - Packet loss = 93%, RTA = 214.50 ms
[04:13:07]  <+icinga-wm>	PROBLEM - Host mr1-eqsin.oob IPv6 is DOWN: PING CRITICAL - Packet loss = 100%
[04:19:01]  <+icinga-wm>	RECOVERY - Host mr1-eqsin.oob IPv6 is UP: PING WARNING - Packet loss = 80%, RTA = 214.67 ms
[04:26:47]  <+icinga-wm>	PROBLEM - Host mr1-eqsin.oob IPv6 is DOWN: PING CRITICAL - Packet loss = 100%
[05:13:55]  <+icinga-wm>	RECOVERY - Host mr1-eqsin.oob IPv6 is UP: PING WARNING - Packet loss = 93%, RTA = 217.54 ms
[05:21:41]  <+icinga-wm>	PROBLEM - Host mr1-eqsin.oob IPv6 is DOWN: PING CRITICAL - Packet loss = 100%
[06:14:43]  <+icinga-wm>	RECOVERY - Host mr1-eqsin.oob IPv6 is UP: PING WARNING - Packet loss = 86%, RTA = 215.02 ms
[06:22:33]  <+icinga-wm>	PROBLEM - Host mr1-eqsin.oob IPv6 is DOWN: PING CRITICAL - Packet loss = 100%

The ripe atlas ipv6 alert is in CRITICAL state as well again.

Related to a HE issue, see T228015