Page MenuHomePhabricator

can't SSH to elastic2050.mgmt
Closed, ResolvedPublic

Description

elastic2050.mgmt is down. Maybe a restart should fix this?

history of repair attempts

  • @Mathew.onipe sent a bmc reset via the OS, no effect.
  • @RobH cannot ping the mgmt interface.
  • interface likely needs full power removal/reset.

Event Timeline

Mentioned in SAL (#wikimedia-operations) [2019-08-16T14:39:06Z] <onimisionipe> run bmc-device --cold-reset; echo $? in elastic2050 hoping it resets mgmt interface -T230597

Please note this mgmt interface is still down:

robh@cumin2001:~$ ping elastic2050.mgmt.codfw.wmnet
PING elastic2050.mgmt.codfw.wmnet (10.193.3.56) 56(84) bytes of data.

no ping returns.

First step, checking the cable (@Papaul will have to do this.)
If that doesn't fix it, fixing the drac at this point requires a full system power loss/removal to reset the drac.

When can this system experience power loss/removal from use for a few minutes?

RobH added a project: ops-codfw.

IRC sync: Chatted with @Mathew.onipe, who let me know they had synced with @Papaul to take this offline on Monday to reset the power/bmc.

@Mathew.onipe any reason why this is set to high priority ?

Mathew.onipe lowered the priority of this task from High to Medium.Aug 16 2019, 3:47 PM

@Papaul On second thought, we have other servers and losing one elastic node is Ok. So this should be set to normal

Mentioned in SAL (#wikimedia-operations) [2019-08-19T07:59:46Z] <onimisionipe> shutdown elastic2050 to prepare for mgmt reset - T230597

Upgrade firmware as well
Before
BIOS Version 1.5.6
iDRAC Firmware Version 3.21.21.21

After
BIOS Version 2.2.11
iDRAC Firmware Version 3.34.34.34

Server is back up . Resolving this.

Mentioned in SAL (#wikimedia-operations) [2019-08-19T16:45:22Z] <onimisionipe> pool elastic2050. mgmt issue has been resolved - T230597