Page MenuHomePhabricator

icinga1001 mysterious reboots
Closed, DuplicatePublic

Description

During provisioning of icinga1001, we have experienced two mysterious reboots.

First noted was 2018-10-02 at 21:47Z
Second noted was 2018-11-19 at 17:55Z

  • These reboots have not come with corresponding SEL entries
  • Debian Stretch was reinstalled several times between these two reboots
  • The logs in syslog append a bunch of nulls ("^@") and then kernel startup logs immediately after

This is our primary alerting host at the moment and we can transition to codfw if there is anything you want to check.

Related Objects

StatusAssignedTask
DuplicateNone
ResolvedVolans

Event Timeline

Restricted Application added a project: Operations. · View Herald TranscriptNov 21 2018, 8:23 PM

I do not see anything in the logs that would tell me where to start. It appears to be working correctly

Cmjohnson moved this task from Backlog to Not urgent on the ops-eqiad board.Nov 26 2018, 5:28 PM
jijiki triaged this task as Low priority.Dec 3 2018, 1:26 PM
jijiki added a subscriber: jijiki.

@Cmjohnson @cwhite @Dzahn Has the host rebooted mysteriously again? If not, do you think we should close it?

Volans added a subscriber: Volans.Feb 10 2019, 7:39 PM

I'm merging this with T214760 as those are now clearly just two different manifestation of the same issue (stuck and reboot) and we have the same entries in getraclog:

--------------------------------------------------------------------------------
SeqNumber       = 167
Message ID      = SYS1003
Category        = Audit
AgentID         = DE
Severity        = Information
Timestamp       = 2018-11-19 17:53:30
Message         = System CPU Resetting.
FQDD            = iDRAC.Embedded.1#HostPowerCtrl
--------------------------------------------------------------------------------
SeqNumber       = 166
Message ID      = RAC0703
Category        = Audit
AgentID         = RACLOG
Severity        = Information
Timestamp       = 2018-11-19 17:53:14
Message         = Requested system hardreset.
FQDD            = iDRAC.Embedded.1
--------------------------------------------------------------------------------
SeqNumber       = 166
Message ID      = RAC0703
Category        = Audit
AgentID         = RACLOG
Severity        = Information
Timestamp       = 2018-11-19 17:53:14
Message         = Requested system hardreset.
FQDD            = iDRAC.Embedded.1
--------------------------------------------------------------------------------


[...SNIP...]


--------------------------------------------------------------------------------
SeqNumber       = 90
Message ID      = SYS1003
Category        = Audit
AgentID         = DE
Severity        = Information
Timestamp       = 2018-10-02 21:48:38
Message         = System CPU Resetting.
FQDD            = iDRAC.Embedded.1#HostPowerCtrl
--------------------------------------------------------------------------------
SeqNumber       = 89
Message ID      = RAC0703
Category        = Audit
AgentID         = RACLOG
Severity        = Information
Timestamp       = 2018-10-02 21:48:22
Message         = Requested system hardreset.
FQDD            = iDRAC.Embedded.1
--------------------------------------------------------------------------------
Volans closed this task as a duplicate of T214760: icinga1001 crashed.Feb 10 2019, 7:39 PM
RobH changed the status of subtask T214760: icinga1001 crashed from Open to Stalled.Mar 12 2019, 6:40 PM