Page MenuHomePhabricator

analytics1050 host + mgmt down
Closed, ResolvedPublic

Description

Hi everybody,

analytics1050 was reported down earlier on and I tried to check its mgmt serial console. Right after typing console com2 in the idrac I noticed that everything was unresponsive, and after a couple of minutes icinga reported also the mgmt down for the same host.

Not sure how to debug this issue further, can you help? The host is not critical, it is a regular Hadoop worker.

Event Timeline

elukey created this task.Jul 20 2020, 6:26 AM
Restricted Application added a project: Operations. · View Herald TranscriptJul 20 2020, 6:26 AM

Right after creating this task (of course) the mgmt idrac returned available. getsel show as last event something from days ago:

Record:      35
Date/Time:   03/05/2020 15:45:28
Source:      system
Severity:    Ok
Description: The power supplies are redundant.

I re-tried console com2 and it got stuck again :(

Also tried with a racadm racreset soft, but the issue is the same.

elukey triaged this task as Medium priority.Jul 21 2020, 11:25 AM
wiki_willy added a project: DC-Ops.

@elukey This may need a hard power reset. Can I take it down?

@Cmjohnson yep do anything that you need!

Mentioned in SAL (#wikimedia-operations) [2020-08-04T14:41:04Z] <cmjohnson1> powercycling analytics1050 T258370

Cmjohnson closed this task as Resolved.Aug 4 2020, 2:48 PM

@elukey the power reset cleared the issue, I was able to login to the idrac and reach console com2. Resolving the task