Page MenuHomePhabricator

mw1239 memory errors
Closed, ResolvedPublic

Description

mw1239 is experiencing memory errors, should we do a memory test?

[Fri Jul 12 10:37:31 2019] mce: [Hardware Error]: Machine check events logged
[Fri Jul 12 10:37:31 2019] EDAC sbridge MC0: HANDLING MCE MEMORY ERROR
[Fri Jul 12 10:37:31 2019] EDAC sbridge MC0: CPU 0: Machine Check Event: 0 Bank 10: 8c000048000800c1
[Fri Jul 12 10:37:31 2019] EDAC sbridge MC0: TSC 0
[Fri Jul 12 10:37:31 2019] EDAC sbridge MC0: ADDR 70148000
[Fri Jul 12 10:37:31 2019] EDAC sbridge MC0: MISC 90008000800108c
[Fri Jul 12 10:37:31 2019] EDAC sbridge MC0: PROCESSOR 0:306e4 TIME 1562927847 SOCKET 0 APIC 0
[Fri Jul 12 10:37:31 2019] EDAC MC0: 1 CE memory scrubbing error on CPU_SrcID#0_Ha#0_Chan#1_DIMM#0 (channel:1 slot:0 page:0x70148 offset:0x0 grain:32 syndrome:0x0 -  area:DRAM err_code:0008:00c1 socket:0 ha:0 channel_mask:2 rank:0)

Event Timeline

jijiki triaged this task as Normal priority.Jul 12 2019, 11:07 AM
jijiki created this task.
jijiki updated the task description. (Show Details)
Cmjohnson added a subscriber: Cmjohnson.EditedJul 12 2019, 6:43 PM

This server is out of warranty, I can reseat the DIMM but will need the server to powered down. If the error persists then the server will need to decommissioned.

wiki_willy assigned this task to jijiki.Jul 15 2019, 6:51 PM
wiki_willy added a subscriber: wiki_willy.

Assigning to @jijiki for now. Hi Effie - let us know when it would be ok to take this server down to reseat the DIMM, and then assign the task back to @Cmjohnson when ready.

Thanks,
Willy

Mentioned in SAL (#wikimedia-operations) [2019-07-15T21:55:58Z] <jijiki> Depool mw1239 for maintenance - T227867

jijiki reassigned this task from jijiki to Cmjohnson.Jul 15 2019, 9:56 PM

Thank you!

Last log paste before clearing the log

Record: 4
Date/Time: 11/08/2018 00:18:01
Source: system
Severity: Non-Critical

Description: Correctable memory error rate exceeded for DIMM_A1.

Record: 5
Date/Time: 12/11/2018 12:56:19
Source: system
Severity: Critical

Description: Correctable memory error rate exceeded for DIMM_A1.

I swapped all the DIMM from side A to side B cleared the log and powered back up. Please put the server back in service and let's see if the reseating worked.

Cmjohnson closed this task as Resolved.Jul 16 2019, 3:27 PM

I am resolving this ticket, please re-open and ping me if the problem returns.

Mentioned in SAL (#wikimedia-operations) [2019-07-17T08:17:30Z] <jijiki> Pool mw1239 - T227867