I found icinga1001 crashed (no ping, no ssh, black screen at the console). I've forced a reboot and it seemed to have rebooted fine. I just had to restart ircecho as it didn't connect properly the first time (icinga-wm was not in the oprations channel).
We were lucky that happened that I was awake although very late and noticed the email from our external monitoring, that, although few false positives in recent months, should probably be promoted to a paging alert.
As for the diagnostic, racadm getsel reported:
------------------------------------------------------------------------------- Record: 2 Date/Time: 01/26/2019 10:35:33 Source: system Severity: Ok Description: A problem was detected during Power-On Self-Test (POST). ------------------------------------------------------------------------------- Record: 3 Date/Time: 01/26/2019 10:35:33 Source: system Severity: Critical Description: The watchdog timer reset the system. -------------------------------------------------------------------------------
while racadm getraclog reported:
-------------------------------------------------------------------------------- SeqNumber = 210 Message ID = SYS1003 Category = Audit AgentID = DE Severity = Information Timestamp = 2019-01-26 10:34:05 Message = System CPU Resetting. FQDD = iDRAC.Embedded.1#HostPowerCtrl -------------------------------------------------------------------------------- SeqNumber = 209 Message ID = SYS1000 Category = Audit AgentID = DE Severity = Information Timestamp = 2019-01-26 10:33:54 Message = System is turning on. FQDD = iDRAC.Embedded.1#HostPowerCtrl -------------------------------------------------------------------------------- SeqNumber = 208 Message ID = LOG007 Category = Audit AgentID = DE Severity = Information Timestamp = 2019-01-26 10:33:54 Message = The previous log entry was repeated 1 times. Message Arg 1 = 1 -------------------------------------------------------------------------------- SeqNumber = 206 Message ID = SYS1001 Category = Audit AgentID = DE Severity = Information Timestamp = 2019-01-26 10:33:45 Message = System is turning off. FQDD = iDRAC.Embedded.1#HostPowerCtrl -------------------------------------------------------------------------------- SeqNumber = 205 Message ID = SYS1003 Category = Audit AgentID = DE Severity = Information Timestamp = 2019-01-26 10:33:45 Message = System CPU Resetting. FQDD = iDRAC.Embedded.1#HostPowerCtrl -------------------------------------------------------------------------------- SeqNumber = 203 Message ID = SYS1000 Category = Audit AgentID = DE Severity = Information Timestamp = 2019-01-26 10:33:06 Message = System is turning on. FQDD = iDRAC.Embedded.1#HostPowerCtrl -------------------------------------------------------------------------------- SeqNumber = 202 Message ID = SYS1001 Category = Audit AgentID = DE Severity = Information Timestamp = 2019-01-26 10:32:57 Message = System is turning off. FQDD = iDRAC.Embedded.1#HostPowerCtrl -------------------------------------------------------------------------------- SeqNumber = 201 Message ID = SYS1003 Category = Audit AgentID = DE Severity = Information Timestamp = 2019-01-26 10:32:57 Message = System CPU Resetting. FQDD = iDRAC.Embedded.1#HostPowerCtrl -------------------------------------------------------------------------------- SeqNumber = 199 Message ID = RAC0703 Category = Audit AgentID = RACLOG Severity = Information Timestamp = 2019-01-26 10:32:41 Message = Requested system hardreset. FQDD = iDRAC.Embedded.1 --------------------------------------------------------------------------------