Page MenuHomePhabricator

graphite2001 crashed
Closed, DeclinedPublic

Description

From the logs, it seems that a processor failed on 2018-06-24T16:24:58 (UTC), leading to a system crash, requiring a forced restart:

$ ipmi-sel
...
10  | Jun-24-2018 | 16:24:58 | CPU Machine Chk  | Processor                | transition to Non-recoverable ; OEM Event Data2 code = 01h ; OEM Event Data3 code = 00h
11  | Jun-24-2018 | 16:24:58 | MSR Info Log     | OEM Reserved             | OEM Event Offset = 09h ; OEM Event Data2 code = 04h ; OEM Event Data3 code = 00h
12  | Jun-24-2018 | 16:24:58 | MSR Info Log     | OEM Reserved             | OEM Event Offset = 00h
13  | Jun-24-2018 | 16:24:58 | MSR Info Log     | OEM Reserved             | OEM Event Offset = 0Ch
14  | Jun-24-2018 | 16:24:58 | MSR Info Log     | OEM Reserved             | OEM Event Offset = 00h
15  | Jun-24-2018 | 16:24:58 | MSR Info Log     | OEM Reserved             | OEM Event Offset = 0Ah ; OEM Event Data2 code = 04h ; OEM Event Data3 code = 00h
16  | Jun-24-2018 | 16:24:58 | MSR Info Log     | OEM Reserved             | OEM Event Offset = 00h
17  | Jun-24-2018 | 16:24:58 | MSR Info Log     | OEM Reserved             | OEM Event Offset = 08h ; OEM Event Data2 code = E7h ; OEM Event Data3 code = 0Eh
18  | Jun-24-2018 | 16:24:58 | MSR Info Log     | OEM Reserved             | OEM Event Offset = 00h
19  | Jun-24-2018 | 16:26:03 | CPU Machine Chk  | Processor                | transition to Non-recoverable ; OEM Event Data2 code = 01h ; OEM Event Data3 code = 00h
20  | Jun-24-2018 | 16:26:03 | MSR Info Log     | OEM Reserved             | OEM Event Offset = 09h ; OEM Event Data2 code = 04h ; OEM Event Data3 code = 00h
21  | Jun-24-2018 | 16:26:03 | MSR Info Log     | OEM Reserved             | OEM Event Offset = 00h
22  | Jun-24-2018 | 16:26:03 | MSR Info Log     | OEM Reserved             | OEM Event Offset = 0Ch
23  | Jun-24-2018 | 16:26:03 | MSR Info Log     | OEM Reserved             | OEM Event Offset = 00h
24  | Jun-24-2018 | 16:26:03 | Sensor #9        | Processor                | IERR ; OEM Event Data2 code = 01h ; OEM Event Data3 code = 00h

Adding monitoring, not because this is related to monitoring, but because I don't know which is a good owner, so it can be decided what to do next.

Related Objects

StatusSubtypeAssignedTask
Resolvedfgiunchedi
DeclinedNone

Event Timeline

fgiunchedi subscribed.

Thanks @jcrespo ! We're replacing this machine soon in T196483: rack/setup/install graphite2003 so I'll triage this as low for now and set its parent.

JJMC89 renamed this task from 6daaaaaaaa to graphite2001 crashed.Jul 1 2018, 3:26 AM
JJMC89 lowered the priority of this task from High to Low.
JJMC89 updated the task description. (Show Details)
JJMC89 added a subscriber: Aklapper.

Host is going to be decom -- declining