Page MenuHomePhabricator

msw-c6-codfw offline
Closed, ResolvedPublic

Description

msw-c6-codfw appears to be offline, causing the mgmt interfaces of everything in that rack to also go offline. The groups for each affected host have been tagged, since this can affect their maintenance and use of the systems. (They can remove their project tags if they don't wish them to remain.)

@Papaul,

This should be a fairly quick fix. Hopefully the netgear isn't bad. If it is, set it aside (it has a lifetime warranty), and use a spare EX4200 for now in its place. Detailed directions below.

Please note that icinga alerted about all the hosts in the rack losing mgmt interface connectivity:

14:23 < icinga-wm>  :  PROBLEM - Host ps1-c6-codfw is DOWN: PING CRITICAL - Packet loss = 100%
14:24 < icinga-wm>  :  PROBLEM - Host ms-be2015.mgmt is DOWN: PING CRITICAL - Packet loss = 100%
14:24 < icinga-wm>  :  PROBLEM - Host db2043.mgmt is DOWN: PING CRITICAL - Packet loss = 100%
14:26 < icinga-wm>  :  PROBLEM - Host db2039.mgmt is DOWN: PING CRITICAL - Packet loss = 100%
14:27 < icinga-wm>  :  PROBLEM - Host db2033.mgmt is DOWN: PING CRITICAL - Packet loss = 100%
14:27 < icinga-wm>  :  PROBLEM - Host db2035.mgmt is DOWN: PING CRITICAL - Packet loss = 100%
14:27 < icinga-wm>  :  PROBLEM - Host db2037.mgmt is DOWN: PING CRITICAL - Packet loss = 100%
14:27 < icinga-wm>  :  PROBLEM - Host db2036.mgmt is DOWN: PING CRITICAL - Packet loss = 100%
14:27 < icinga-wm>  :  PROBLEM - Host db2044.mgmt is DOWN: PING CRITICAL - Packet loss = 100%
14:27 < icinga-wm>  :  PROBLEM - Host db2038.mgmt is DOWN: PING CRITICAL - Packet loss = 100%
14:27 < icinga-wm>  :  PROBLEM - Host db2041.mgmt is DOWN: PING CRITICAL - Packet loss = 100%
14:27 < icinga-wm>  :  PROBLEM - Host db2042.mgmt is DOWN: PING CRITICAL - Packet loss = 100%
14:27 < icinga-wm>  :  PROBLEM - Host db2040.mgmt is DOWN: PING CRITICAL - Packet loss = 100%
14:27 < icinga-wm>  :  PROBLEM - Host db2047.mgmt is DOWN: PING CRITICAL - Packet loss = 100%
14:27 < icinga-wm>  :  PROBLEM - Host db2048.mgmt is DOWN: PING CRITICAL - Packet loss = 100%
14:27 < icinga-wm>  :  PROBLEM - Host db2046.mgmt is DOWN: PING CRITICAL - Packet loss = 100%

Checklist for repair:

  • mgmt netgears are simplistic, it could simply be the single power supply has become unplugged, please check this first
  • attempt to power cycle the netgear (remove power cable and plug it back in)
  • rule out bad power cable + bad power port (try another power cable, try another power plug/port)
  • if netgear is bad, the spares tracking sheet shows there are two spare EX4200 there, serials: BP0212064074 & BP0212234923. We can wipe the config on one of these and use it as a non-managed switch as msw-c6-codfw.

Event Timeline

RobH triaged this task as High priority.Mar 30 2018, 10:25 PM
RobH created this task.
Restricted Application added a subscriber: Aklapper. · View Herald Transcript

I've agreed with @RobH on IRC that this is not UBN for now for the DBA part.

Although assessing the situation I discovered that the rack distribution is far from optimal as we have 6 core masters and 3 misc masters that are in this rack. I'll make sure to followup with the DBA folks separately as I'm not sure if there are already plans to fix this and/or it's tracked somewhere (I didn't find anything with a quick search on Phab but it's late for me so I might have missed something obvious).

I've agreed with @RobH on IRC that this is not UBN for now for the DBA part.

Although assessing the situation I discovered that the rack distribution is far from optimal as we have 6 core masters and 3 misc masters that are in this rack. I'll make sure to followup with the DBA folks separately as I'm not sure if there are already plans to fix this and/or it's tracked somewhere (I didn't find anything with a quick search on Phab but it's late for me so I might have missed something obvious).

Thanks @Volans - we wanted to accomplish that while doing all the switchovers to replace the old hosts that had to be replaced, but looks like we never did it :-)
I will create a task for that

Thanks again!

I have created: T191193 to track the masters movement

Removed power for 2 minutes and plugged back. Leaving this task open for now to monitor the switch.

Papaul lowered the priority of this task from High to Low.Apr 2 2018, 2:50 PM

The servers are reporting the recoveries already :-)
Thanks!

Bad switch state is the easiest recovery, so that is nice.

I think we can consider this resolved.
Thanks guys!