Page MenuHomePhabricator

analytics1046/analytics1057 stuck in booting
Closed, ResolvedPublic

Description

Hi everybody,

analytics1046 and analytics1057 are old hadoop worker nodes that we have just refreshed, probably already OOW. Before giving it back, the Analytics team would need them for one last duty, namely be part of a temporary hadoop cluster to support our incoming Hadoop migration to a new distribution (already asked to Faidon if it was ok to keep nodes for this last use case).

The hosts cannot pass the very first bootstrap steps, not even if I wait hours/days:

PowerEdge Expandable RAID Controller BIOS
Copyright(c) 2015 Avago Technologies
Press <Ctrl><R> to Run Configuration Utility
F/W Initializing Devices 0%

I don't want to drain a lot of people's time to debug this use case, but after a chat in dcops with Willy there seems to be some steps to attempt to see if it can come back to life or not (like upgrading the firmware, bios, idrac, etc..) so if anybody has time I'd appreciate some feedback about how to proceed (that could even be: let's not spend time on this please).

The hosts are not running any puppet production role so it can taken down, rebooted, etc.. anytime.

Thanks in advance!

Event Timeline

elukey renamed this task from analytics1046 stuck in booting to analytics1046/analytics1057 stuck in booting.Nov 6 2020, 2:24 PM
elukey updated the task description. (Show Details)

Both servers are stuck at the same spot during post. I tried rebooting an-1046 but it still sticks, One of the power supplies is bad and I replaced it with one from a spare but there seems to be more of a problem. I am trying to update bios and idrac now to see if that helps. The h/w log doesn't show anything wrong. These are both well out of warranty and if this doesn't fix the issue we need to have them decommissioned.

@elukey @razzi @wiki_willy The servers are stuck and I cannot update bios or firmware. Please decommission.

Thanks for checking @Cmjohnson, will do :)