Page MenuHomePhabricator

mr1-eqsin down since ~01:50 UTC
Closed, ResolvedPublic

Description

At about 01:50 UTC, mr1-eqsin seemed to go down: unreachable from Icinga and LibreNMS; OSPF status alerts for it on cr1-eqsin and cr2-eqsin.
There are some messages from SSH in its syslog right before it disappeared, but they appear to be routine logspam.

01:51:29        <+icinga-wm>    PROBLEM - Host cp5010.mgmt is DOWN: PING CRITICAL - Packet loss = 100%
01:51:37        <+icinga-wm>    PROBLEM - Host asw1-eqsin is DOWN: PING CRITICAL - Packet loss = 100%
01:51:39        <+icinga-wm>    PROBLEM - Host cp5006.mgmt is DOWN: PING CRITICAL - Packet loss = 100%
01:51:39        <+icinga-wm>    PROBLEM - Host mr1-eqsin is DOWN: PING CRITICAL - Packet loss = 100%
01:51:39        <+icinga-wm>    PROBLEM - Host cp5009.mgmt is DOWN: PING CRITICAL - Packet loss = 100%
01:51:39        <+icinga-wm>    PROBLEM - Host cp5005.mgmt is DOWN: PING CRITICAL - Packet loss = 100%
01:51:39        <+icinga-wm>    PROBLEM - Host cp5004.mgmt is DOWN: PING CRITICAL - Packet loss = 100%
01:51:39        <+icinga-wm>    PROBLEM - Host cp5012.mgmt is DOWN: PING CRITICAL - Packet loss = 100%
01:51:45        <+icinga-wm>    PROBLEM - Host cp5011.mgmt is DOWN: PING CRITICAL - Packet loss = 100%
01:51:49        <+icinga-wm>    PROBLEM - Host dns5001.mgmt is DOWN: PING CRITICAL - Packet loss = 100%
01:51:49        <+icinga-wm>    PROBLEM - Host lvs5002.mgmt is DOWN: PING CRITICAL - Packet loss = 100%
01:51:49        <+icinga-wm>    PROBLEM - Host mr1-eqsin IPv6 is DOWN: PING CRITICAL - Packet loss = 100%
01:51:49        <+icinga-wm>    PROBLEM - Host lvs5001.mgmt is DOWN: PING CRITICAL - Packet loss = 100%
01:51:49        <+icinga-wm>    PROBLEM - Host dns5002.mgmt is DOWN: PING CRITICAL - Packet loss = 100%
01:54:13        <+icinga-wm>    PROBLEM - Host cp5008.mgmt is DOWN: PING CRITICAL - Packet loss = 100%
01:54:13        <+icinga-wm>    PROBLEM - Host cp5007.mgmt is DOWN: PING CRITICAL - Packet loss = 100%
01:54:15        <+icinga-wm>    PROBLEM - OSPF status on cr1-eqsin is CRITICAL: OSPFv2: 2/3 UP : OSPFv3: 2/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
01:54:17        <+icinga-wm>    PROBLEM - OSPF status on cr2-eqsin is CRITICAL: OSPFv2: 2/3 UP : OSPFv3: 2/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
01:54:47        <+icinga-wm>    PROBLEM - Host bast5001.mgmt is DOWN: PING CRITICAL - Packet loss = 100%
01:54:47        <+icinga-wm>    PROBLEM - Host lvs5003.mgmt is DOWN: PING CRITICAL - Packet loss = 100%
01:56:17        <+icinga-wm>    PROBLEM - Host cp5003.mgmt is DOWN: PING CRITICAL - Packet loss = 100%
01:56:17        <+icinga-wm>    PROBLEM - Host cp5001.mgmt is DOWN: PING CRITICAL - Packet loss = 100%
01:56:17        <+icinga-wm>    PROBLEM - Host cp5002.mgmt is DOWN: PING CRITICAL - Packet loss = 100%

Event Timeline

@CDanis - I just checked with our 3rd party contractor and he says it shouldn't have been affected from the work he was doing. Although, he was working in the racks from 1:45-4:00 UTC, and If it only alerted for a few minutes, it could've been possible that something might've accidentally been bumped while he was installing the 3 servers. It's no longer alerting, right?

Thanks,
Willy

Still alerting, unfortunately.

Alright, I'm asking him to go back to the datacenter to check all the connections on mr1-eqsin.

wiki_willy added a project: ops-eqsin.

Cable between mr1-eqsin p4 <---> asw-0603-eqsin p23 looks like it accidentally got bumped by the contractor during the server install. Called him back and he was able to resolve the issue by reseating the cables. Link has been stable for the past 15min now. Resolving task.

RECOVERY - Host cp5005.mgmt is UP: PING OK - Packet loss = 16%, RTA = 231.87 ms
10:23 PM RECOVERY - Host cp5010.mgmt is UP: PING OK - Packet loss = 0%, RTA = 232.23 ms
10:23 PM RECOVERY - Host dns5002.mgmt is UP: PING OK - Packet loss = 0%, RTA = 249.35 ms
10:23 PM RECOVERY - Host mr1-eqsin is UP: PING OK - Packet loss = 0%, RTA = 231.89 ms
10:23 PM RECOVERY - Host asw1-eqsin is UP: PING OK - Packet loss = 0%, RTA = 232.13 ms
10:23 PM RECOVERY - Host cp5009.mgmt is UP: PING OK - Packet loss = 0%, RTA = 232.04 ms
10:23 PM RECOVERY - Host cp5006.mgmt is UP: PING OK - Packet loss = 0%, RTA = 232.18 ms
10:23 PM RECOVERY - Host cp5012.mgmt is UP: PING OK - Packet loss = 0%, RTA = 232.09 ms
10:23 PM RECOVERY - Host cp5004.mgmt is UP: PING OK - Packet loss = 0%, RTA = 232.11 ms
10:23 PM RECOVERY - Host lvs5002.mgmt is UP: PING OK - Packet loss = 0%, RTA = 232.10 ms
10:23 PM RECOVERY - Host lvs5001.mgmt is UP: PING OK - Packet loss = 0%, RTA = 232.22 ms
10:23 PM RECOVERY - Host dns5001.mgmt is UP: PING OK - Packet loss = 0%, RTA = 232.82 ms
10:23 PM RECOVERY - Host mr1-eqsin IPv6 is UP: PING OK - Packet loss = 0%, RTA = 233.61 ms
10:23 PM RECOVERY - Host cp5011.mgmt is UP: PING OK - Packet loss = 0%, RTA = 316.17 ms

Thanks,
Willy

We just got all the recoveries:

[07:23:15]  <+icinga-wm>	RECOVERY - Host cp5005.mgmt is UP: PING OK - Packet loss = 16%, RTA = 231.87 ms
[07:23:17]  <+icinga-wm>	RECOVERY - Host cp5010.mgmt is UP: PING OK - Packet loss = 0%, RTA = 232.23 ms
[07:23:21]  <+icinga-wm>	RECOVERY - Host dns5002.mgmt is UP: PING OK - Packet loss = 0%, RTA = 249.35 ms
[07:23:27]  <+icinga-wm>	RECOVERY - Host mr1-eqsin is UP: PING OK - Packet loss = 0%, RTA = 231.89 ms
[07:23:27]  <+icinga-wm>	RECOVERY - Host asw1-eqsin is UP: PING OK - Packet loss = 0%, RTA = 232.13 ms
[07:23:29]  <+icinga-wm>	RECOVERY - Host cp5009.mgmt is UP: PING OK - Packet loss = 0%, RTA = 232.04 ms
[07:23:29]  <+icinga-wm>	RECOVERY - Host cp5006.mgmt is UP: PING OK - Packet loss = 0%, RTA = 232.18 ms
[07:23:29]  <+icinga-wm>	RECOVERY - Host cp5012.mgmt is UP: PING OK - Packet loss = 0%, RTA = 232.09 ms
[07:23:29]  <+icinga-wm>	RECOVERY - Host cp5004.mgmt is UP: PING OK - Packet loss = 0%, RTA = 232.11 ms
[07:23:43]  <+icinga-wm>	RECOVERY - Host lvs5002.mgmt is UP: PING OK - Packet loss = 0%, RTA = 232.10 ms
[07:23:43]  <+icinga-wm>	RECOVERY - Host lvs5001.mgmt is UP: PING OK - Packet loss = 0%, RTA = 232.22 ms
[07:23:43]  <+icinga-wm>	RECOVERY - Host dns5001.mgmt is UP: PING OK - Packet loss = 0%, RTA = 232.82 ms
[07:23:43]  <+icinga-wm>	RECOVERY - Host mr1-eqsin IPv6 is UP: PING OK - Packet loss = 0%, RTA = 233.61 ms
[07:23:43]  <+icinga-wm>	RECOVERY - Host cp5011.mgmt is UP: PING OK - Packet loss = 0%, RTA = 316.17 ms
[07:24:11]  <+icinga-wm>	RECOVERY - OSPF status on cr1-eqsin is OK: OSPFv2: 3/3 UP : OSPFv3: 3/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[07:24:11]  <+icinga-wm>	RECOVERY - OSPF status on cr2-eqsin is OK: OSPFv2: 3/3 UP : OSPFv3: 3/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[07:24:13]  <+icinga-wm>	RECOVERY - Host lvs5003.mgmt is UP: PING OK - Packet loss = 0%, RTA = 231.79 ms
[07:24:39]  <+icinga-wm>	RECOVERY - Host cp5003.mgmt is UP: PING OK - Packet loss = 0%, RTA = 231.94 ms

(Times are UTC+2)