Page MenuHomePhabricator

hw troubleshooting: CPU alerts for parse1002.eqiad.wmnet
Closed, ResolvedPublicRequest

Description

  • - Provide FQDN of system.
  • - If other than a hard drive issue, please depool the machine (and confirm that it’s been depooled) for us to work on it. If not, please provide time frame for us to take the machine down.
  • - Put system into a failed state in Netbox.
  • - Provide urgency of request, along with justification (redundancy, dependencies, etc)
  • - Describe issue and/or attach hardware failure log. (Refer to https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook if you need help)
  • - Assign correct project tag and appropriate owner (based on above). Also, please ensure the service owners of the host(s) are added as subscribers to provide any additional input.

FQDN: parse1002.eqiad.wmnet
Failure: CPU
Urgency: Low
Warranty status: OK

Event Timeline

Host rebooted spontaneously:

09:30 <+icinga-wm> PROBLEM - Host parse1002 is DOWN: PING CRITICAL - Packet loss = 100%
09:31 <claime> ^ checking
09:31 <+icinga-wm> RECOVERY - Host parse1002 is UP: PING OK - Packet loss = 0%, RTA = 0.32 ms

Further investigation reveal CPU issue in ipmi-sel:

cgoubert@parse1002:~$ sudo ipmi-sel | grep CPU
4   | Aug-30-2022 | 06:22:15 | CPU Machine Chk  | Processor                   | transition to Non-recoverable ; OEM Event Data2 code = 02h ; OEM Event Data3 code = 20h
18  | Aug-30-2022 | 06:24:49 | CPU Machine Chk  | Processor                   | transition to Non-recoverable ; OEM Event Data2 code = 02h ; OEM Event Data3 code = 20h
31  | Oct-15-2022 | 13:41:00 | CPU Machine Chk  | Processor                   | transition to Non-recoverable ; OEM Event Data2 code = 02h ; OEM Event Data3 code = 20h
45  | Oct-15-2022 | 13:43:07 | CPU Machine Chk  | Processor                   | transition to Non-recoverable ; OEM Event Data2 code = 02h ; OEM Event Data3 code = 20h
58  | Dec-12-2022 | 08:27:38 | CPU Machine Chk  | Processor                   | transition to Non-recoverable ; OEM Event Data2 code = 02h ; OEM Event Data3 code = 20h

racadm getlog says the same:

racadm>>racadm getsel
...
Record:      4
Date/Time:   08/30/2022 06:22:15
Source:      system
Severity:    Critical
Description: CPU 2 machine check error detected.
[...]
-------------------------------------------------------------------------------
Record:      18
Date/Time:   08/30/2022 06:24:49
Source:      system
Severity:    Critical
Description: CPU 2 machine check error detected.
[...]
-------------------------------------------------------------------------------
Record:      31
Date/Time:   10/15/2022 13:41:00
Source:      system
Severity:    Critical
Description: CPU 2 machine check error detected.
[...]
-------------------------------------------------------------------------------
Record:      45
Date/Time:   10/15/2022 13:43:07
Source:      system
Severity:    Critical
Description: CPU 2 machine check error detected.
[...]
-------------------------------------------------------------------------------
Record:      58
Date/Time:   12/12/2022 08:27:38
Source:      system
Severity:    Critical
Description: CPU 2 machine check error detected.

Full logs attached below

racadm>>racadm getsel
Record:      1
Date/Time:   01/24/2022 17:43:06
Source:      system
Severity:    Ok
Description: Log cleared.
-------------------------------------------------------------------------------
Record:      2
Date/Time:   01/24/2022 17:48:42
Source:      system
Severity:    Ok
Description: C: boot completed.
-------------------------------------------------------------------------------
Record:      3
Date/Time:   01/24/2022 17:48:42
Source:      system
Severity:    Ok
Description: OEM software event.
-------------------------------------------------------------------------------
Record:      4
Date/Time:   08/30/2022 06:22:15
Source:      system
Severity:    Critical
Description: CPU 2 machine check error detected.
-------------------------------------------------------------------------------
Record:      5
Date/Time:   08/30/2022 06:22:15
Source:      system
Severity:    Ok
Description: An OEM diagnostic event occurred.
-------------------------------------------------------------------------------
Record:      6
Date/Time:   08/30/2022 06:22:15
Source:      system
Severity:    Ok
Description: An OEM diagnostic event occurred.
-------------------------------------------------------------------------------
Record:      7
Date/Time:   08/30/2022 06:22:15
Source:      system
Severity:    Ok
Description: An OEM diagnostic event occurred.
-------------------------------------------------------------------------------
Record:      8
Date/Time:   08/30/2022 06:22:15
Source:      system
Severity:    Ok
Description: An OEM diagnostic event occurred.
-------------------------------------------------------------------------------
Record:      9
Date/Time:   08/30/2022 06:22:15
Source:      system
Severity:    Ok
Description: An OEM diagnostic event occurred.
-------------------------------------------------------------------------------
Record:      10
Date/Time:   08/30/2022 06:22:15
Source:      system
Severity:    Ok
Description: An OEM diagnostic event occurred.
-------------------------------------------------------------------------------
Record:      11
Date/Time:   08/30/2022 06:22:16
Source:      system
Severity:    Ok
Description: An OEM diagnostic event occurred.
-------------------------------------------------------------------------------
Record:      12
Date/Time:   08/30/2022 06:22:16
Source:      system
Severity:    Ok
Description: An OEM diagnostic event occurred.
-------------------------------------------------------------------------------
Record:      13
Date/Time:   08/30/2022 06:22:16
Source:      system
Severity:    Ok
Description: An OEM diagnostic event occurred.
-------------------------------------------------------------------------------
Record:      14
Date/Time:   08/30/2022 06:22:16
Source:      system
Severity:    Ok
Description: An OEM diagnostic event occurred.
-------------------------------------------------------------------------------
Record:      15
Date/Time:   08/30/2022 06:22:16
Source:      system
Severity:    Ok
Description: An OEM diagnostic event occurred.
-------------------------------------------------------------------------------
Record:      16
Date/Time:   08/30/2022 06:22:16
Source:      system
Severity:    Ok
Description: An OEM diagnostic event occurred.
-------------------------------------------------------------------------------
Record:      17
Date/Time:   08/30/2022 06:24:49
Source:      system
Severity:    Ok
Description: A problem was detected related to the previous server boot.
-------------------------------------------------------------------------------
Record:      18
Date/Time:   08/30/2022 06:24:49
Source:      system
Severity:    Critical
Description: CPU 2 machine check error detected.
-------------------------------------------------------------------------------
Record:      19
Date/Time:   08/30/2022 06:24:49
Source:      system
Severity:    Ok
Description: An OEM diagnostic event occurred.
-------------------------------------------------------------------------------
Record:      20
Date/Time:   08/30/2022 06:24:49
Source:      system
Severity:    Ok
Description: An OEM diagnostic event occurred.
-------------------------------------------------------------------------------
Record:      21
Date/Time:   08/30/2022 06:24:49
Source:      system
Severity:    Ok
Description: An OEM diagnostic event occurred.
-------------------------------------------------------------------------------
Record:      22
Date/Time:   08/30/2022 06:24:49
Source:      system
Severity:    Ok
Description: An OEM diagnostic event occurred.
-------------------------------------------------------------------------------
Record:      23
Date/Time:   08/30/2022 06:24:50
Source:      system
Severity:    Ok
Description: An OEM diagnostic event occurred.
-------------------------------------------------------------------------------
Record:      24
Date/Time:   08/30/2022 06:24:50
Source:      system
Severity:    Ok
Description: An OEM diagnostic event occurred.
-------------------------------------------------------------------------------
Record:      25
Date/Time:   08/30/2022 06:24:50
Source:      system
Severity:    Ok
Description: An OEM diagnostic event occurred.
-------------------------------------------------------------------------------
Record:      26
Date/Time:   08/30/2022 06:24:50
Source:      system
Severity:    Ok
Description: An OEM diagnostic event occurred.
-------------------------------------------------------------------------------
Record:      27
Date/Time:   08/30/2022 06:24:50
Source:      system
Severity:    Ok
Description: An OEM diagnostic event occurred.
-------------------------------------------------------------------------------
Record:      28
Date/Time:   08/30/2022 06:24:50
Source:      system
Severity:    Ok
Description: An OEM diagnostic event occurred.
-------------------------------------------------------------------------------
Record:      29
Date/Time:   08/30/2022 06:24:50
Source:      system
Severity:    Ok
Description: An OEM diagnostic event occurred.
-------------------------------------------------------------------------------
Record:      30
Date/Time:   08/30/2022 06:24:50
Source:      system
Severity:    Ok
Description: An OEM diagnostic event occurred.
-------------------------------------------------------------------------------
Record:      31
Date/Time:   10/15/2022 13:41:00
Source:      system
Severity:    Critical
Description: CPU 2 machine check error detected.
-------------------------------------------------------------------------------
Record:      32
Date/Time:   10/15/2022 13:41:00
Source:      system
Severity:    Ok
Description: An OEM diagnostic event occurred.
-------------------------------------------------------------------------------
Record:      33
Date/Time:   10/15/2022 13:41:00
Source:      system
Severity:    Ok
Description: An OEM diagnostic event occurred.
-------------------------------------------------------------------------------
Record:      34
Date/Time:   10/15/2022 13:41:00
Source:      system
Severity:    Ok
Description: An OEM diagnostic event occurred.
-------------------------------------------------------------------------------
Record:      35
Date/Time:   10/15/2022 13:41:00
Source:      system
Severity:    Ok
Description: An OEM diagnostic event occurred.
-------------------------------------------------------------------------------
Record:      36
Date/Time:   10/15/2022 13:41:01
Source:      system
Severity:    Ok
Description: An OEM diagnostic event occurred.
-------------------------------------------------------------------------------
Record:      37
Date/Time:   10/15/2022 13:41:01
Source:      system
Severity:    Ok
Description: An OEM diagnostic event occurred.
-------------------------------------------------------------------------------
Record:      38
Date/Time:   10/15/2022 13:41:01
Source:      system
Severity:    Ok
Description: An OEM diagnostic event occurred.
-------------------------------------------------------------------------------
Record:      39
Date/Time:   10/15/2022 13:41:01
Source:      system
Severity:    Ok
Description: An OEM diagnostic event occurred.
-------------------------------------------------------------------------------
Record:      40
Date/Time:   10/15/2022 13:41:01
Source:      system
Severity:    Ok
Description: An OEM diagnostic event occurred.
-------------------------------------------------------------------------------
Record:      41
Date/Time:   10/15/2022 13:41:01
Source:      system
Severity:    Ok
Description: An OEM diagnostic event occurred.
-------------------------------------------------------------------------------
Record:      42
Date/Time:   10/15/2022 13:41:01
Source:      system
Severity:    Ok
Description: An OEM diagnostic event occurred.
-------------------------------------------------------------------------------
Record:      43
Date/Time:   10/15/2022 13:41:01
Source:      system
Severity:    Ok
Description: An OEM diagnostic event occurred.
-------------------------------------------------------------------------------
Record:      44
Date/Time:   10/15/2022 13:43:07
Source:      system
Severity:    Ok
Description: A problem was detected related to the previous server boot.
-------------------------------------------------------------------------------
Record:      45
Date/Time:   10/15/2022 13:43:07
Source:      system
Severity:    Critical
Description: CPU 2 machine check error detected.
-------------------------------------------------------------------------------
Record:      46
Date/Time:   10/15/2022 13:43:07
Source:      system
Severity:    Ok
Description: An OEM diagnostic event occurred.
-------------------------------------------------------------------------------
Record:      47
Date/Time:   10/15/2022 13:43:07
Source:      system
Severity:    Ok
Description: An OEM diagnostic event occurred.
-------------------------------------------------------------------------------
Record:      48
Date/Time:   10/15/2022 13:43:07
Source:      system
Severity:    Ok
Description: An OEM diagnostic event occurred.
-------------------------------------------------------------------------------
Record:      49
Date/Time:   10/15/2022 13:43:08
Source:      system
Severity:    Ok
Description: An OEM diagnostic event occurred.
-------------------------------------------------------------------------------
Record:      50
Date/Time:   10/15/2022 13:43:08
Source:      system
Severity:    Ok
Description: An OEM diagnostic event occurred.
-------------------------------------------------------------------------------
Record:      51
Date/Time:   10/15/2022 13:43:08
Source:      system
Severity:    Ok
Description: An OEM diagnostic event occurred.
-------------------------------------------------------------------------------
Record:      52
Date/Time:   10/15/2022 13:43:08
Source:      system
Severity:    Ok
Description: An OEM diagnostic event occurred.
-------------------------------------------------------------------------------
Record:      53
Date/Time:   10/15/2022 13:43:08
Source:      system
Severity:    Ok
Description: An OEM diagnostic event occurred.
-------------------------------------------------------------------------------
Record:      54
Date/Time:   10/15/2022 13:43:08
Source:      system
Severity:    Ok
Description: An OEM diagnostic event occurred.
-------------------------------------------------------------------------------
Record:      55
Date/Time:   10/15/2022 13:43:08
Source:      system
Severity:    Ok
Description: An OEM diagnostic event occurred.
-------------------------------------------------------------------------------
Record:      56
Date/Time:   10/15/2022 13:43:08
Source:      system
Severity:    Ok
Description: An OEM diagnostic event occurred.
-------------------------------------------------------------------------------
Record:      57
Date/Time:   10/15/2022 13:43:08
Source:      system
Severity:    Ok
Description: An OEM diagnostic event occurred.
-------------------------------------------------------------------------------
Record:      58
Date/Time:   12/12/2022 08:27:38
Source:      system
Severity:    Critical
Description: CPU 2 machine check error detected.
-------------------------------------------------------------------------------
Record:      59
Date/Time:   12/12/2022 08:27:38
Source:      system
Severity:    Ok
Description: An OEM diagnostic event occurred.
-------------------------------------------------------------------------------
Record:      60
Date/Time:   12/12/2022 08:27:38
Source:      system
Severity:    Ok
Description: An OEM diagnostic event occurred.
-------------------------------------------------------------------------------
Record:      61
Date/Time:   12/12/2022 08:27:39
Source:      system
Severity:    Ok
Description: An OEM diagnostic event occurred.
-------------------------------------------------------------------------------
Record:      62
Date/Time:   12/12/2022 08:27:39
Source:      system
Severity:    Ok
Description: An OEM diagnostic event occurred.
-------------------------------------------------------------------------------
Record:      63
Date/Time:   12/12/2022 08:27:39
Source:      system
Severity:    Ok
Description: An OEM diagnostic event occurred.
-------------------------------------------------------------------------------
Record:      64
Date/Time:   12/12/2022 08:27:39
Source:      system
Severity:    Ok
Description: An OEM diagnostic event occurred.
-------------------------------------------------------------------------------
Record:      65
Date/Time:   12/12/2022 08:27:39
Source:      system
Severity:    Ok
Description: An OEM diagnostic event occurred.
-------------------------------------------------------------------------------
Record:      66
Date/Time:   12/12/2022 08:27:39
Source:      system
Severity:    Ok
Description: An OEM diagnostic event occurred.
-------------------------------------------------------------------------------
Record:      67
Date/Time:   12/12/2022 08:27:39
Source:      system
Severity:    Ok
Description: An OEM diagnostic event occurred.
-------------------------------------------------------------------------------
Record:      68
Date/Time:   12/12/2022 08:27:39
Source:      system
Severity:    Ok
Description: An OEM diagnostic event occurred.
-------------------------------------------------------------------------------
Record:      69
Date/Time:   12/12/2022 08:27:39
Source:      system
Severity:    Ok
Description: An OEM diagnostic event occurred.
-------------------------------------------------------------------------------
Record:      70
Date/Time:   12/12/2022 08:27:40
Source:      system
Severity:    Ok
Description: An OEM diagnostic event occurred.
-------------------------------------------------------------------------------
cgoubert@parse1002:~$ sudo ipmi-sel
ID  | Date        | Time     | Name             | Type                        | Event
1   | Jan-24-2022 | 17:43:06 | SEL              | Event Logging Disabled      | Log Area Reset/Cleared
2   | Jan-24-2022 | 17:48:42 | Sensor #0        | OS Boot                     | C: boot completed
3   | Jan-24-2022 | 17:48:42 | N/A              | N/A                         | OEM defined = 00h 74h E6h EEh 61h 00h
4   | Aug-30-2022 | 06:22:15 | CPU Machine Chk  | Processor                   | transition to Non-recoverable ; OEM Event Data2 code = 02h ; OEM Event Data3 code = 20h
5   | Aug-30-2022 | 06:22:15 | MSR Info Log     | OEM Reserved                | OEM Event Offset = 05h ; OEM Event Data3 code = 00h
6   | Aug-30-2022 | 06:22:15 | MSR Info Log     | OEM Reserved                | OEM Event Offset = 00h
7   | Aug-30-2022 | 06:22:15 | MSR Info Log     | OEM Reserved                | OEM Event Offset = 0Bh
8   | Aug-30-2022 | 06:22:15 | MSR Info Log     | OEM Reserved                | OEM Event Offset = 00h
9   | Aug-30-2022 | 06:22:15 | MSR Info Log     | OEM Reserved                | OEM Event Offset = 06h ; OEM Event Data3 code = 00h
10  | Aug-30-2022 | 06:22:15 | MSR Info Log     | OEM Reserved                | OEM Event Offset = 00h
11  | Aug-30-2022 | 06:22:16 | MSR Info Log     | OEM Reserved                | OEM Event Offset = 05h ; OEM Event Data2 code = 23h ; OEM Event Data3 code = 20h
12  | Aug-30-2022 | 06:22:16 | MSR Info Log     | OEM Reserved                | OEM Event Offset = 00h
13  | Aug-30-2022 | 06:22:16 | MSR Info Log     | OEM Reserved                | OEM Event Offset = 07h ; OEM Event Data3 code = 00h
14  | Aug-30-2022 | 06:22:16 | MSR Info Log     | OEM Reserved                | OEM Event Offset = 00h
15  | Aug-30-2022 | 06:22:16 | MSR Info Log     | OEM Reserved                | OEM Event Offset = 09h ; OEM Event Data2 code = 80h ; OEM Event Data3 code = A0h
16  | Aug-30-2022 | 06:22:16 | MSR Info Log     | OEM Reserved                | OEM Event Offset = 0Ah ; OEM Event Data2 code = 64h
17  | Aug-30-2022 | 06:24:49 | Additional Info  | OEM Reserved                | OEM Event Offset = 02h ; OEM Event Data2 code = 02h ; OEM Event Data3 code = 00h
18  | Aug-30-2022 | 06:24:49 | CPU Machine Chk  | Processor                   | transition to Non-recoverable ; OEM Event Data2 code = 02h ; OEM Event Data3 code = 20h
19  | Aug-30-2022 | 06:24:49 | MSR Info Log     | OEM Reserved                | OEM Event Offset = 05h ; OEM Event Data3 code = 00h
20  | Aug-30-2022 | 06:24:49 | MSR Info Log     | OEM Reserved                | OEM Event Offset = 00h
21  | Aug-30-2022 | 06:24:49 | MSR Info Log     | OEM Reserved                | OEM Event Offset = 0Bh
22  | Aug-30-2022 | 06:24:49 | MSR Info Log     | OEM Reserved                | OEM Event Offset = 00h
23  | Aug-30-2022 | 06:24:50 | MSR Info Log     | OEM Reserved                | OEM Event Offset = 06h ; OEM Event Data3 code = 00h
24  | Aug-30-2022 | 06:24:50 | MSR Info Log     | OEM Reserved                | OEM Event Offset = 00h
25  | Aug-30-2022 | 06:24:50 | MSR Info Log     | OEM Reserved                | OEM Event Offset = 05h ; OEM Event Data2 code = 23h ; OEM Event Data3 code = 20h
26  | Aug-30-2022 | 06:24:50 | MSR Info Log     | OEM Reserved                | OEM Event Offset = 00h
27  | Aug-30-2022 | 06:24:50 | MSR Info Log     | OEM Reserved                | OEM Event Offset = 07h ; OEM Event Data3 code = 00h
28  | Aug-30-2022 | 06:24:50 | MSR Info Log     | OEM Reserved                | OEM Event Offset = 00h
29  | Aug-30-2022 | 06:24:50 | MSR Info Log     | OEM Reserved                | OEM Event Offset = 09h ; OEM Event Data2 code = 80h ; OEM Event Data3 code = A0h
30  | Aug-30-2022 | 06:24:50 | MSR Info Log     | OEM Reserved                | OEM Event Offset = 0Ah ; OEM Event Data2 code = 64h
31  | Oct-15-2022 | 13:41:00 | CPU Machine Chk  | Processor                   | transition to Non-recoverable ; OEM Event Data2 code = 02h ; OEM Event Data3 code = 20h
32  | Oct-15-2022 | 13:41:00 | MSR Info Log     | OEM Reserved                | OEM Event Offset = 05h ; OEM Event Data3 code = 00h
33  | Oct-15-2022 | 13:41:00 | MSR Info Log     | OEM Reserved                | OEM Event Offset = 00h
34  | Oct-15-2022 | 13:41:00 | MSR Info Log     | OEM Reserved                | OEM Event Offset = 0Bh
35  | Oct-15-2022 | 13:41:00 | MSR Info Log     | OEM Reserved                | OEM Event Offset = 00h
36  | Oct-15-2022 | 13:41:01 | MSR Info Log     | OEM Reserved                | OEM Event Offset = 06h ; OEM Event Data3 code = 00h
37  | Oct-15-2022 | 13:41:01 | MSR Info Log     | OEM Reserved                | OEM Event Offset = 00h
38  | Oct-15-2022 | 13:41:01 | MSR Info Log     | OEM Reserved                | OEM Event Offset = 06h ; OEM Event Data2 code = 3Fh ; OEM Event Data3 code = 20h
39  | Oct-15-2022 | 13:41:01 | MSR Info Log     | OEM Reserved                | OEM Event Offset = 00h
40  | Oct-15-2022 | 13:41:01 | MSR Info Log     | OEM Reserved                | OEM Event Offset = 07h ; OEM Event Data3 code = 00h
41  | Oct-15-2022 | 13:41:01 | MSR Info Log     | OEM Reserved                | OEM Event Offset = 00h
42  | Oct-15-2022 | 13:41:01 | MSR Info Log     | OEM Reserved                | OEM Event Offset = 08h ; OEM Event Data2 code = 80h ; OEM Event Data3 code = 20h
43  | Oct-15-2022 | 13:41:01 | MSR Info Log     | OEM Reserved                | OEM Event Offset = 01h ; OEM Event Data2 code = A4h ; OEM Event Data3 code = 00h
44  | Oct-15-2022 | 13:43:07 | Additional Info  | OEM Reserved                | OEM Event Offset = 02h ; OEM Event Data2 code = 02h ; OEM Event Data3 code = 00h
45  | Oct-15-2022 | 13:43:07 | CPU Machine Chk  | Processor                   | transition to Non-recoverable ; OEM Event Data2 code = 02h ; OEM Event Data3 code = 20h
46  | Oct-15-2022 | 13:43:07 | MSR Info Log     | OEM Reserved                | OEM Event Offset = 05h ; OEM Event Data3 code = 00h
47  | Oct-15-2022 | 13:43:07 | MSR Info Log     | OEM Reserved                | OEM Event Offset = 00h
48  | Oct-15-2022 | 13:43:07 | MSR Info Log     | OEM Reserved                | OEM Event Offset = 0Bh
49  | Oct-15-2022 | 13:43:08 | MSR Info Log     | OEM Reserved                | OEM Event Offset = 00h
50  | Oct-15-2022 | 13:43:08 | MSR Info Log     | OEM Reserved                | OEM Event Offset = 06h ; OEM Event Data3 code = 00h
51  | Oct-15-2022 | 13:43:08 | MSR Info Log     | OEM Reserved                | OEM Event Offset = 00h
52  | Oct-15-2022 | 13:43:08 | MSR Info Log     | OEM Reserved                | OEM Event Offset = 06h ; OEM Event Data2 code = 3Fh ; OEM Event Data3 code = 20h
53  | Oct-15-2022 | 13:43:08 | MSR Info Log     | OEM Reserved                | OEM Event Offset = 00h
54  | Oct-15-2022 | 13:43:08 | MSR Info Log     | OEM Reserved                | OEM Event Offset = 07h ; OEM Event Data3 code = 00h
55  | Oct-15-2022 | 13:43:08 | MSR Info Log     | OEM Reserved                | OEM Event Offset = 00h
56  | Oct-15-2022 | 13:43:08 | MSR Info Log     | OEM Reserved                | OEM Event Offset = 08h ; OEM Event Data2 code = 80h ; OEM Event Data3 code = 20h
57  | Oct-15-2022 | 13:43:08 | MSR Info Log     | OEM Reserved                | OEM Event Offset = 01h ; OEM Event Data2 code = A4h ; OEM Event Data3 code = 00h
58  | Dec-12-2022 | 08:27:38 | CPU Machine Chk  | Processor                   | transition to Non-recoverable ; OEM Event Data2 code = 02h ; OEM Event Data3 code = 20h
59  | Dec-12-2022 | 08:27:38 | MSR Info Log     | OEM Reserved                | OEM Event Offset = 09h ; OEM Event Data3 code = 00h
60  | Dec-12-2022 | 08:27:38 | MSR Info Log     | OEM Reserved                | OEM Event Offset = 00h
61  | Dec-12-2022 | 08:27:39 | MSR Info Log     | OEM Reserved                | OEM Event Offset = 0Bh
62  | Dec-12-2022 | 08:27:39 | MSR Info Log     | OEM Reserved                | OEM Event Offset = 00h
63  | Dec-12-2022 | 08:27:39 | MSR Info Log     | OEM Reserved                | OEM Event Offset = 0Ah ; OEM Event Data3 code = 00h
64  | Dec-12-2022 | 08:27:39 | MSR Info Log     | OEM Reserved                | OEM Event Offset = 00h
65  | Dec-12-2022 | 08:27:39 | MSR Info Log     | OEM Reserved                | OEM Event Offset = 0Eh ; OEM Event Data2 code = B8h ; OEM Event Data3 code = 1Bh
66  | Dec-12-2022 | 08:27:39 | MSR Info Log     | OEM Reserved                | OEM Event Offset = 00h
67  | Dec-12-2022 | 08:27:39 | MSR Info Log     | OEM Reserved                | OEM Event Offset = 0Bh ; OEM Event Data3 code = 00h
68  | Dec-12-2022 | 08:27:39 | MSR Info Log     | OEM Reserved                | OEM Event Offset = 00h
69  | Dec-12-2022 | 08:27:39 | MSR Info Log     | OEM Reserved                | OEM Event Offset = 08h ; OEM Event Data2 code = 80h ; OEM Event Data3 code = A0h
70  | Dec-12-2022 | 08:27:40 | MSR Info Log     | OEM Reserved                | OEM Event Offset = 0Ah ; OEM Event Data2 code = A4h ; OEM Event Data3 code = 00h

Host depooled:

cgoubert@cumin1001:~$ sudo confctl  select 'name=parse1002.eqiad.wmnet' set/pooled=inactive
The selector you chose has selected the following objects:
{"/eqiad/parsoid/canary": ["parse1002.eqiad.wmnet"], "/eqiad/parsoid/parsoid-php": ["parse1002.eqiad.wmnet"]}
Ok to continue? [y/N]
confctl>y
eqiad/parsoid/canary/parse1002.eqiad.wmnet: pooled changed yes => inactive
eqiad/parsoid/parsoid-php/parse1002.eqiad.wmnet: pooled changed yes => inactive
WARNING:conftool.announce:conftool action : set/pooled=inactive; selector: name=parse1002.eqiad.wmnet

Mentioned in SAL (#wikimedia-operations) [2022-12-12T10:17:25Z] <claime> depooled parse1002.eqiad.wmnet for hw failure - T324949

Change 867123 had a related patch set uploaded (by Clément Goubert; author: Clément Goubert):

[operations/puppet@production] scap/conftool: Switch parsoid canary to parse1003

https://gerrit.wikimedia.org/r/867123

Change 867123 merged by Clément Goubert:

[operations/puppet@production] scap/conftool: Switch parsoid canary to parse1003

https://gerrit.wikimedia.org/r/867123

Mentioned in SAL (#wikimedia-operations) [2022-12-12T10:52:55Z] <claime> Switched parse1002 to parse1003 in parsoid-canary - T324949

Icinga downtime and Alertmanager silence (ID=cafc663b-25d8-4e28-8aea-f704dec7742e) set by cgoubert@cumin1001 for 14 days, 0:00:00 on 1 host(s) and their services with reason: Bad CPU

parse1002.eqiad.wmnet
Jclark-ctr added subscribers: Cmjohnson, Jclark-ctr.

Opened Dell support ticket Confirmed: Service Request 158148016 was successfully submitted

@Clement_Goubert dell has requested firmware updates

Updated BIOS and iDRAC firmware to latest versions as BIOS firmware contains updated processor and memory reference codes.
BIOS: https://dl.dell.com/FOLDER08909837M/1/BIOS_CKFTD_WN64_2.16.1.EXE
iDRAC: https://dl.dell.com/FOLDER09050918M/1/iDRAC-with-Lifecycle-Controller_Firmware_D92HF_WN64_6.00.30.00_A00.EXE

Server can be put back in service

Mentioned in SAL (#wikimedia-operations) [2022-12-13T09:07:51Z] <claime> Repooled parse1002.eqiad.wmnet in parsoid service - T324949

the state of parse1002 was manually changed in netbox from "active" to "failed" but there was no sync / cookbook run.

This meant at next unrelated decom cookbook run on another host we got unexpected diffs that the state of parse1002 changes from "active" to "failed".

Which now contradicts the ticket where it says it's active (pooled) again.

Mentioned in SAL (#wikimedia-operations) [2022-12-13T22:40:15Z] <mutante> netbox: set parse1002 status: failed -> active in web UI; ran cookbook 'sre.puppet.sync-netbox-hiera' to get data in sync - T324949

Added documentation to avoid forgetting this step, DC-Ops feel free to revert or ask me to move it elsewhere if you feel it shouldn't be there.

Thanks for adding docs! That's the perfect reaction. I just wanted to create awareness originally. Your edit https://wikitech.wikimedia.org/w/index.php?title=SRE%2FDc-operations%2FHardware_Troubleshooting_Runbook&diff=2040654&oldid=1985613 looks good to me :)