Page MenuHomePhabricator

Eqiad: C6 mgmt switch down
Closed, ResolvedPublic

Description

Looks like all the hosts in C6 got their mgmt interface down

Times are in UTC +2:

[07:42:18]  <+icinga-wm>	PROBLEM - Host mw1320.mgmt is DOWN: PING CRITICAL - Packet loss = 100%
[07:42:18]  <+icinga-wm>	PROBLEM - Host mw1322.mgmt is DOWN: PING CRITICAL - Packet loss = 100%
[07:42:20]  <+icinga-wm>	PROBLEM - Host mw1323.mgmt is DOWN: PING CRITICAL - Packet loss = 100%
[07:42:20]  <+icinga-wm>	PROBLEM - Host mw1321.mgmt is DOWN: PING CRITICAL - Packet loss = 100%
[07:42:26]  <+icinga-wm>	PROBLEM - Host mw1324.mgmt is DOWN: PING CRITICAL - Packet loss = 100%
[07:42:44]  <+icinga-wm>	PROBLEM - Host bast1002.mgmt is DOWN: PING CRITICAL - Packet loss = 100%
[07:42:44]  <+icinga-wm>	PROBLEM - Host ps1-c6-eqiad is DOWN: PING CRITICAL - Packet loss = 100%
[07:43:18]  <+icinga-wm>	PROBLEM - Host db1134.mgmt is DOWN: PING CRITICAL - Packet loss = 100%
[07:46:56]  <+icinga-wm>	PROBLEM - Host mw1326.mgmt is DOWN: PING CRITICAL - Packet loss = 100%
[07:46:56]  <+icinga-wm>	PROBLEM - Host mw1327.mgmt is DOWN: PING CRITICAL - Packet loss = 100%
[07:46:56]  <+icinga-wm>	PROBLEM - Host mw1330.mgmt is DOWN: PING CRITICAL - Packet loss = 100%
[07:46:56]  <+icinga-wm>	PROBLEM - Host mw1329.mgmt is DOWN: PING CRITICAL - Packet loss = 100%
[07:46:56]  <+icinga-wm>	PROBLEM - Host mw1334.mgmt is DOWN: PING CRITICAL - Packet loss = 100%
[07:46:58]  <+icinga-wm>	PROBLEM - Host mw1328.mgmt is DOWN: PING CRITICAL - Packet loss = 100%
[07:46:58]  <+icinga-wm>	PROBLEM - Host mw1331.mgmt is DOWN: PING CRITICAL - Packet loss = 100%
[07:47:00]  <+icinga-wm>	PROBLEM - Host mw1337.mgmt is DOWN: PING CRITICAL - Packet loss = 100%
[07:47:00]  <+icinga-wm>	PROBLEM - Host mw1336.mgmt is DOWN: PING CRITICAL - Packet loss = 100%
[07:47:00]  <+icinga-wm>	PROBLEM - Host mw1332.mgmt is DOWN: PING CRITICAL - Packet loss = 100%
[07:47:00]  <+icinga-wm>	PROBLEM - Host mw1340.mgmt is DOWN: PING CRITICAL - Packet loss = 100%
[07:47:02]  <+icinga-wm>	PROBLEM - Host mw1344.mgmt is DOWN: PING CRITICAL - Packet loss = 100%
[07:47:02]  <+icinga-wm>	PROBLEM - Host mw1341.mgmt is DOWN: PING CRITICAL - Packet loss = 100%
[07:47:02]  <+icinga-wm>	PROBLEM - Host mw1333.mgmt is DOWN: PING CRITICAL - Packet loss = 100%
[07:47:02]  <+icinga-wm>	PROBLEM - Host mw1335.mgmt is DOWN: PING CRITICAL - Packet loss = 100%
[07:47:06]  <+icinga-wm>	PROBLEM - Host mw1338.mgmt is DOWN: PING CRITICAL - Packet loss = 100%
[07:47:06]  <+icinga-wm>	PROBLEM - Host mw1347.mgmt is DOWN: PING CRITICAL - Packet loss = 100%
[07:47:06]  <+icinga-wm>	PROBLEM - Host mw1339.mgmt is DOWN: PING CRITICAL - Packet loss = 100%
[07:47:06]  <+icinga-wm>	PROBLEM - Host mw1342.mgmt is DOWN: PING CRITICAL - Packet loss = 100%
[07:47:06]  <+icinga-wm>	PROBLEM - Host mw1345.mgmt is DOWN: PING CRITICAL - Packet loss = 100%
[07:47:06]  <+icinga-wm>	PROBLEM - Host mw1348.mgmt is DOWN: PING CRITICAL - Packet loss = 100%
[07:47:11]  <+icinga-wm>	PROBLEM - Host mw1343.mgmt is DOWN: PING CRITICAL - Packet loss = 100%
[07:47:11]  <+icinga-wm>	PROBLEM - Host mw1346.mgmt is DOWN: PING CRITICAL - Packet loss = 100%

Looks only related to mgmt so far.

Event Timeline

On msw1 I see all events like the following, starting at 5:40 UTC:

Apr  3 05:40:24  msw1-eqiad chassism[1399]: ifd_process_flaps IFD: ge-0/0/23, sent flap msg to RE, Downstate
Apr  3 05:40:24  msw1-eqiad chassism[1399]: 	 Link status change event: ifd ge-0/0/23 MAC ctrl reg0 :: 0x8BE5, MAC port status reg0 :: 0x6802, MAC auto-neg reg :: 0xB1F4
Apr  3 05:40:24  msw1-eqiad chassism[1399]: 	Link status change event: ifd ge-0/0/23 PHY Link Status: DOWN,LP-AN capable: NO
Apr  3 05:40:24  msw1-eqiad chassism[1399]: 	Link status change event: ifd ge-0/0/23 AN Status: Pending, Speed: 1000 Mbps, Duplex: HALF DUPLEX,Remote Link Fault: NO

And then:

ge-0/0/23       up    down Core: msw-c6-eqiad [1Gbps Cu]

Interesting that ganeti1011's mgmt interface recovered, but not the others. Adding dcops to see if we can schedule in the next days/weeks a check of msw-c6-eqiad.

ayounsi triaged this task as High priority.Apr 3 2020, 6:39 AM
  • Check msw-c6-eqiad's status
  • Check msw-c6-eqiad cabling to msw1-eqiad

Replace either cable or switch depending on what's faulty.

ayounsi renamed this task from Eqiad: C6 mgmt switch glitch to Eqiad: C6 mgmt switch down .Apr 3 2020, 6:43 AM

Assigning to @Cmjohnson, since he'll be onsite today

@XioNoX the netgear switch does not have any power to it, I tried replacing the power cable and used a different power outlet and still nothing. These do not have redundant power and we do not have a spare on-site @RobH or @wiki_willy we need to order a replacement

wiki_willy added a subtask: Unknown Object (Task).Apr 3 2020, 5:34 PM

replaced the management switch today and updated netbox with new information, keeping the same name. changed the old one to msw-c6-eqiad-old and set status to decommissioning

Jclark-ctr closed subtask Unknown Object (Task) as Resolved.May 8 2020, 7:34 PM