Page MenuHomePhabricator

can't SSH to elastic2050.mgmt
Closed, ResolvedPublic

Description

elastic2050.mgmt is down. Maybe a restart should fix this?

history of repair attempts

  • @Mathew.onipe sent a bmc reset via the OS, no effect.
  • @RobH cannot ping the mgmt interface.
  • interface likely needs full power removal/reset.

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptAug 16 2019, 8:05 AM
Mathew.onipe updated the task description. (Show Details)Aug 16 2019, 8:08 AM
Mathew.onipe triaged this task as High priority.Aug 16 2019, 10:12 AM
Mathew.onipe added a project: DC-Ops.
Mathew.onipe added subscribers: Papaul, Cmjohnson.

Mentioned in SAL (#wikimedia-operations) [2019-08-16T14:39:06Z] <onimisionipe> run bmc-device --cold-reset; echo $? in elastic2050 hoping it resets mgmt interface -T230597

RobH added a subscriber: RobH.Aug 16 2019, 2:47 PM

Please note this mgmt interface is still down:

robh@cumin2001:~$ ping elastic2050.mgmt.codfw.wmnet
PING elastic2050.mgmt.codfw.wmnet (10.193.3.56) 56(84) bytes of data.

no ping returns.

First step, checking the cable (@Papaul will have to do this.)
If that doesn't fix it, fixing the drac at this point requires a full system power loss/removal to reset the drac.

When can this system experience power loss/removal from use for a few minutes?

RobH assigned this task to Papaul.Aug 16 2019, 3:21 PM
RobH added a project: ops-codfw.

IRC sync: Chatted with @Mathew.onipe, who let me know they had synced with @Papaul to take this offline on Monday to reset the power/bmc.

RobH updated the task description. (Show Details)

@Mathew.onipe any reason why this is set to high priority ?

Mathew.onipe lowered the priority of this task from High to Normal.Aug 16 2019, 3:47 PM

@Papaul On second thought, we have other servers and losing one elastic node is Ok. So this should be set to normal

Mentioned in SAL (#wikimedia-operations) [2019-08-19T07:59:46Z] <onimisionipe> shutdown elastic2050 to prepare for mgmt reset - T230597

Papaul closed this task as Resolved.Mon, Aug 19, 3:04 PM

Upgrade firmware as well
Before
BIOS Version 1.5.6
iDRAC Firmware Version 3.21.21.21

After
BIOS Version 2.2.11
iDRAC Firmware Version 3.34.34.34

Server is back up . Resolving this.

Mentioned in SAL (#wikimedia-operations) [2019-08-19T16:45:22Z] <onimisionipe> pool elastic2050. mgmt issue has been resolved - T230597