Page MenuHomePhabricator

mw1379 - down after reboot attempt and DRAC can't powercycle
Closed, ResolvedPublic

Description

mw1379 was reimaged and everything went normal, just at the very end after the final reboot triggered by the reimaging script it never came back and stayed down.

I connected to mgmt in an attempt to powercycle it and found this unusual DRAC behaviour:

racadm serveraction powerstatus - ON
racadm serveraction powerup - already up
racadm serveraction powerdown - can't execute command
racadm serveraction powercycle - can't execute command

Then i did a "racadm racreset" and was able to send those commands again.

Related Objects

StatusSubtypeAssignedTask
ResolvedNone
ResolvedJdforrester-WMF
ResolvedJdforrester-WMF
ResolvedJdforrester-WMF
ResolvedJdforrester-WMF
Resolved toan
ResolvedLucas_Werkmeister_WMDE
ResolvedJoe
ResolvedJdforrester-WMF
ResolvedLadsgroup
InvalidNone
ResolvedReedy
OpenNone
Resolvedtstarling
ResolvedJdforrester-WMF
StalledNone
ResolvedNone
ResolvedPRODUCTION ERRORLegoktm
Resolvedtstarling
ResolvedJoe
ResolvedKrinkle
Resolvedhashar
ResolvedJdforrester-WMF
ResolvedDzahn
ResolvedDzahn

Event Timeline

Dzahn claimed this task.

I looked at 3 hosts wmf-auto-reimage .out log, there were no indication of this issue then i looked at the IDRAC log
of 3 of the hosts that are having this issue (mw1377,mw1378 and mw1379) all have an error in the IDRAC log . see below

Log Sequence Number:
118
Detailed Description:
The operating system or an application failed to communicate to the baseboard management controller (BMC) within the timeout period. The system was reset per the configured setting.
Recommended Action:
Check the operating system, application, hardware, and system event log for exception events.
Log Sequence Number:
119
Detailed Description:
The system was reset due to a timeout from the watchdog timer.
Recommended Action:
Check the System Event Log (SEL) or crash dumps from Operating System to identify the source that caused the watchdog timer reset. Update the firmware or driver for the identified device.

mw1381 did complete the auto re-image process with out any issue and don;t have the error.