Page MenuHomePhabricator

hw troubleshooting: DIMM_B2 for mc2040.codfw.wmnet
Closed, ResolvedPublicRequest

Description

  • - Provide FQDN of system.
  • - If other than a hard drive issue, please depool the machine (and confirm that it’s been depooled) for us to work on it. If not, please provide time frame for us to take the machine down.
  • - Put system into a failed state in Netbox.
  • - Provide urgency of request, along with justification (redundancy, dependencies, etc)
  • - Describe issue and/or attach hardware failure log. (Refer to https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook if you need help)
  • - Assign correct project tag and appropriate owner (based on above). Also, please ensure the service owners of the host(s) are added as subscribers to provide any additional input.

FQDN: mc2040.codfw.wmnet
Issue: Bad DIMM_B2
Urgency: Medium

Event Timeline

1racadm>>racadm getsel
2Record: 1
3Date/Time: 11/19/2021 21:38:45
4Source: system
5Severity: Ok
6Description: Log cleared.
7-------------------------------------------------------------------------------
8Record: 2
9Date/Time: 08/02/2022 16:52:09
10Source: system
11Severity: Critical
12Description: The power input for power supply 1 is lost.
13-------------------------------------------------------------------------------
14Record: 3
15Date/Time: 08/02/2022 16:52:49
16Source: system
17Severity: Critical
18Description: Power supply redundancy is lost.
19-------------------------------------------------------------------------------
20Record: 4
21Date/Time: 08/02/2022 16:57:43
22Source: system
23Severity: Ok
24Description: The input power for power supply 1 has been restored.
25-------------------------------------------------------------------------------
26Record: 5
27Date/Time: 08/02/2022 16:57:44
28Source: system
29Severity: Ok
30Description: The power supplies are redundant.
31-------------------------------------------------------------------------------
32Record: 6
33Date/Time: 12/14/2022 02:04:52
34Source: system
35Severity: Critical
36Description: The system memory has faced an uncorrectable multi-bit memory errors in the non-execution path of a memory device at the location DIMM_B2.
37-------------------------------------------------------------------------------
38Record: 7
39Date/Time: 12/14/2022 02:04:52
40Source: system
41Severity: Ok
42Description: An OEM diagnostic event occurred.
43-------------------------------------------------------------------------------
44Record: 8
45Date/Time: 12/14/2022 02:04:52
46Source: system
47Severity: Ok
48Description: An OEM diagnostic event occurred.
49-------------------------------------------------------------------------------
50Record: 9
51Date/Time: 12/14/2022 02:04:52
52Source: system
53Severity: Ok
54Description: An OEM diagnostic event occurred.
55-------------------------------------------------------------------------------
56Record: 10
57Date/Time: 12/14/2022 02:32:11
58Source: system
59Severity: Non-Critical
60Description: The memory health monitor feature has detected a degradation in the DIMM installed in DIMM_B2. Reboot system to initiate self-heal process.
61-------------------------------------------------------------------------------
62Record: 11
63Date/Time: 01/09/2023 18:34:18
64Source: system
65Severity: Ok
66Description: A problem was detected during Power-On Self-Test (POST).
67-------------------------------------------------------------------------------
68Record: 12
69Date/Time: 01/09/2023 18:34:18
70Source: system
71Severity: Ok
72Description: The self-heal operation successfully completed at DIMM DIMM_B2.
73-------------------------------------------------------------------------------
74Record: 13
75Date/Time: 01/09/2023 18:34:18
76Source: system
77Severity: Ok
78Description: The self-heal operation successfully completed at DIMM DIMM_B2.
79-------------------------------------------------------------------------------
80Record: 14
81Date/Time: 01/12/2023 13:48:14
82Source: system
83Severity: Non-Critical
84Description: The memory health monitor feature has detected a degradation in the DIMM installed in DIMM_B2. Reboot system to initiate self-heal process.
85-------------------------------------------------------------------------------
86Record: 15
87Date/Time: 01/12/2023 13:48:25
88Source: system
89Severity: Critical
90Description: Multi-bit memory errors detected on a memory device at location(s) DIMM_B2.
91-------------------------------------------------------------------------------
92Record: 16
93Date/Time: 01/12/2023 13:50:29
94Source: system
95Severity: Ok
96Description: A problem was detected related to the previous server boot.
97-------------------------------------------------------------------------------
98Record: 17
99Date/Time: 01/12/2023 13:50:29
100Source: system
101Severity: Ok
102Description: The self-heal operation successfully completed at DIMM DIMM_B2.
103-------------------------------------------------------------------------------
104Record: 18
105Date/Time: 01/12/2023 13:50:29
106Source: system
107Severity: Ok
108Description: The self-heal operation successfully completed at DIMM DIMM_B2.
109-------------------------------------------------------------------------------
110Record: 19
111Date/Time: 01/12/2023 13:50:29
112Source: system
113Severity: Ok
114Description: The self-heal operation successfully completed at DIMM DIMM_B2.
115-------------------------------------------------------------------------------
116Record: 20
117Date/Time: 01/12/2023 13:50:29
118Source: system
119Severity: Critical
120Description: Multi-bit memory errors detected on a memory device at location(s) DIMM_B2.
121-------------------------------------------------------------------------------
122racadm>>

cgoubert@mc2040:~$ sudo ipmi-sel | grep Jan-12-2023
14  | Jan-12-2023 | 13:48:14 | Mem ECC Warning  | Memory                      | Monitor
15  | Jan-12-2023 | 13:48:25 | ECC Uncorr Err   | Memory                      | Uncorrectable memory error
16  | Jan-12-2023 | 13:50:29 | Additional Info  | OEM Reserved                | OEM Event Offset = 02h ; OEM Event Data2 code = 02h ; OEM Event Data3 code = 00h
17  | Jan-12-2023 | 13:50:29 | POST Pkg Repair  | Memory                      | Redundancy Degraded from Fully Redundant
18  | Jan-12-2023 | 13:50:29 | POST Pkg Repair  | Memory                      | Redundancy Degraded from Fully Redundant
19  | Jan-12-2023 | 13:50:29 | POST Pkg Repair  | Memory                      | Redundancy Degraded from Fully Redundant
20  | Jan-12-2023 | 13:50:29 | ECC Uncorr Err   | Memory                      | Uncorrectable memory error
``
Dzahn updated the task description. (Show Details)

From what I understand, you can work on it any time, and we don't need to depool it. We may want to downtime it before y'all work on it.

@Clement_Goubert can you downtime the server? Please let me know when I can work on the server.

Icinga downtime and Alertmanager silence (ID=4016a17a-817d-4d48-be1d-b36713ff2632) set by cgoubert@cumin1001 for 1 day, 0:00:00 on 1 host(s) and their services with reason: hardware troubleshooting

mc2040.codfw.wmnet
Jhancock.wm claimed this task.

@Clement_Goubert Thank you.

We powered down and swapped the A2 and B2 DIMM to see if the error carries over. as of right now we are not seeing any errors and the log has been cleared. We can close the task for now. If the error happens again, this task can be reopened.

@Jhancock.wm Thanks for the super quick turnaround. That was fast, wow.

someone needs to follow-up, for example do we set the status back to active in netbox, does it have to be taken back into production, so we should not forget that even though the ticket is closed