Page MenuHomePhabricator

(OoW) wtp2013 memory correctable errors
Closed, ResolvedPublic

Description

Looks like memory errors are being logged, might be a faulty dimm

[4691841.895860] EDAC MC1: 1 CE memory scrubbing error on CPU_SrcID#1_Ha#0_Chan#1_DIMM#0 (channel:1 slot:0 page:0xd5599c offset:0x0 grain:32 syndrome:0x0 -  area:DRAM err_code:0008:00c2 socket:1 ha:0 channel_mask:2 rank:1)
[4691841.895861] EDAC sbridge MC1: HANDLING MCE MEMORY ERROR
[4691841.895862] EDAC sbridge MC1: CPU 1: Machine Check Event: 0 Bank 11: 8c00004d000800c2
[4691841.895862] EDAC sbridge MC1: TSC 0 
[4691841.895863] EDAC sbridge MC1: ADDR d5599c000 
[4691841.895863] EDAC sbridge MC1: MISC 90840030001da8c 
[4691841.895864] EDAC sbridge MC1: PROCESSOR 0:306e4 TIME 1524596867 SOCKET 1 APIC 20
[4691841.895869] EDAC MC1: 1 CE memory scrubbing error on CPU_SrcID#1_Ha#0_Chan#1_DIMM#0 (channel:1 slot:0 page:0xd5599c offset:0x0 grain:32 syndrome:0x0 -  area:DRAM err_code:0008:00c2 socket:1 ha:0 channel_mask:2 rank:1)
[4691841.895870] EDAC sbridge MC1: HANDLING MCE MEMORY ERROR
[4691841.895871] EDAC sbridge MC1: CPU 1: Machine Check Event: 0 Bank 11: 8c00004d000800c2
[4691841.895871] EDAC sbridge MC1: TSC 0 
[4691841.895872] EDAC sbridge MC1: ADDR d5599c000 
[4691841.895872] EDAC sbridge MC1: MISC 908400180019a8c 
[4691841.895873] EDAC sbridge MC1: PROCESSOR 0:306e4 TIME 1524596867 SOCKET 1 APIC 20
[4691841.895878] EDAC MC1: 1 CE memory scrubbing error on CPU_SrcID#1_Ha#0_Chan#1_DIMM#0 (channel:1 slot:0 page:0xd5599c offset:0x0 grain:32 syndrome:0x0 -  area:DRAM err_code:0008:00c2 socket:1 ha:0 channel_mask:2 rank:1)
[4691841.895878] EDAC sbridge MC1: HANDLING MCE MEMORY ERROR
[4691841.895879] EDAC sbridge MC1: CPU 1: Machine Check Event: 0 Bank 11: 8c00004d000800c2
[4691841.895880] EDAC sbridge MC1: TSC 0 
[4691841.895880] EDAC sbridge MC1: ADDR d5599c000 
[4691841.895881] EDAC sbridge MC1: MISC 908400340035a8c 
[4691841.895882] EDAC sbridge MC1: PROCESSOR 0:306e4 TIME 1524596867 SOCKET 1 APIC 20
[4691841.895887] EDAC MC1: 1 CE memory scrubbing error on CPU_SrcID#1_Ha#0_Chan#1_DIMM#0 (channel:1 slot:0 page:0xd5599c offset:0x0 grain:32 syndrome:0x0 -  area:DRAM err_code:0008:00c2 socket:1 ha:0 channel_mask:2 rank:1)
[4691841.895887] EDAC sbridge MC1: HANDLING MCE MEMORY ERROR
[4691841.895888] EDAC sbridge MC1: CPU 1: Machine Check Event: 0 Bank 11: 8c00004d000800c2
[4691841.895888] EDAC sbridge MC1: TSC 0 
[4691841.895889] EDAC sbridge MC1: ADDR d5599c000 
[4691841.895889] EDAC sbridge MC1: MISC 908400140015a8c 
[4691841.895890] EDAC sbridge MC1: PROCESSOR 0:306e4 TIME 1524596867 SOCKET 1 APIC 20
[4691841.895895] EDAC MC1: 1 CE memory scrubbing error on CPU_SrcID#1_Ha#0_Chan#1_DIMM#0 (channel:1 slot:0 page:0xd5599c offset:0x0 grain:32 syndrome:0x0 -  area:DRAM err_code:0008:00c2 socket:1 ha:0 channel_mask:2 rank:1)
[4691841.895896] EDAC sbridge MC1: HANDLING MCE MEMORY ERROR
[4691841.895897] EDAC sbridge MC1: CPU 1: Machine Check Event: 0 Bank 11: 8c00004d000800c2
[4691841.895897] EDAC sbridge MC1: TSC 0 
[4691841.895898] EDAC sbridge MC1: ADDR d5599c000 
[4691841.895898] EDAC sbridge MC1: MISC 908400080009a8c 
[4691841.895899] EDAC sbridge MC1: PROCESSOR 0:306e4 TIME 1524596867 SOCKET 1 APIC 20
[4691841.895904] EDAC MC1: 1 CE memory scrubbing error on CPU_SrcID#1_Ha#0_Chan#1_DIMM#0 (channel:1 slot:0 page:0xd5599c offset:0x0 grain:32 syndrome:0x0 -  area:DRAM err_code:0008:00c2 socket:1 ha:0 channel_mask:2 rank:1)
[4691841.895905] EDAC sbridge MC1: HANDLING MCE MEMORY ERROR
[4691841.895905] EDAC sbridge MC1: CPU 1: Machine Check Event: 0 Bank 11: 8c00004d000800c2
[4691841.895906] EDAC sbridge MC1: TSC 0 
[4691841.895906] EDAC sbridge MC1: ADDR d5599c000 
[4691841.895907] EDAC sbridge MC1: MISC 908400200021a8c 
[4691841.895908] EDAC sbridge MC1: PROCESSOR 0:306e4 TIME 1524596867 SOCKET 1 APIC 20
[4691841.895913] EDAC MC1: 1 CE memory scrubbing error on CPU_SrcID#1_Ha#0_Chan#1_DIMM#0 (channel:1 slot:0 page:0xd5599c offset:0x0 grain:32 syndrome:0x0 -  area:DRAM err_code:0008:00c2 socket:1 ha:0 channel_mask:2 rank:1)
[4691841.895914] EDAC sbridge MC1: HANDLING MCE MEMORY ERROR
[4691841.895914] EDAC sbridge MC1: CPU 1: Machine Check Event: 0 Bank 11: 8c00004d000800c2
[4691841.895915] EDAC sbridge MC1: TSC 0 
[4691841.895916] EDAC sbridge MC1: ADDR d5599c000 
[4691841.895916] EDAC sbridge MC1: MISC 9084003c003da8c 
[4691841.895917] EDAC sbridge MC1: PROCESSOR 0:306e4 TIME 1524596867 SOCKET 1 APIC 20
[4691841.895922] EDAC MC1: 1 CE memory scrubbing error on CPU_SrcID#1_Ha#0_Chan#1_DIMM#0 (channel:1 slot:0 page:0xd5599c offset:0x0 grain:32 syndrome:0x0 -  area:DRAM err_code:0008:00c2 socket:1 ha:0 channel_mask:2 rank:1)
[4691841.895923] EDAC sbridge MC1: HANDLING MCE MEMORY ERROR
[4691841.895923] EDAC sbridge MC1: CPU 1: Machine Check Event: 0 Bank 11: 8c00004d000800c2
[4691841.895924] EDAC sbridge MC1: TSC 0 
[4691841.895924] EDAC sbridge MC1: ADDR d5599c000 
[4691841.895925] EDAC sbridge MC1: MISC 9084000c000da8c 
[4691841.895926] EDAC sbridge MC1: PROCESSOR 0:306e4 TIME 1524596867 SOCKET 1 APIC 20
[4691841.895931] EDAC MC1: 1 CE memory scrubbing error on CPU_SrcID#1_Ha#0_Chan#1_DIMM#0 (channel:1 slot:0 page:0xd5599c offset:0x0 grain:32 syndrome:0x0 -  area:DRAM err_code:0008:00c2 socket:1 ha:0 channel_mask:2 rank:1)
[4691841.895931] EDAC sbridge MC1: HANDLING MCE MEMORY ERROR
[4691841.895932] EDAC sbridge MC1: CPU 1: Machine Check Event: 0 Bank 11: 8c00004d000800c2
[4691841.895932] EDAC sbridge MC1: TSC 0 
[4691841.895933] EDAC sbridge MC1: ADDR d5599c000 
[4691841.895933] EDAC sbridge MC1: MISC 90840004002da8c 
[4691841.895934] EDAC sbridge MC1: PROCESSOR 0:306e4 TIME 1524596867 SOCKET 1 APIC 20
[4691841.895939] EDAC MC1: 1 CE memory scrubbing error on CPU_SrcID#1_Ha#0_Chan#1_DIMM#0 (channel:1 slot:0 page:0xd5599c offset:0x0 grain:32 syndrome:0x0 -  area:DRAM err_code:0008:00c2 socket:1 ha:0 channel_mask:2 rank:1)
[4691841.895941] EDAC sbridge MC1: HANDLING MCE MEMORY ERROR
[4691841.895942] EDAC sbridge MC1: CPU 1: Machine Check Event: 0 Bank 11: 8c00004d000800c2
[4691841.895942] EDAC sbridge MC1: TSC 0 
[4691841.895943] EDAC sbridge MC1: ADDR d5599c000 
[4691841.895943] EDAC sbridge MC1: MISC 908400080009a8c 
[4691841.895944] EDAC sbridge MC1: PROCESSOR 0:306e4 TIME 1524596867 SOCKET 1 APIC 20
[4691841.895949] EDAC MC1: 1 CE memory scrubbing error on CPU_SrcID#1_Ha#0_Chan#1_DIMM#0 (channel:1 slot:0 page:0xd5599c offset:0x0 grain:32 syndrome:0x0 -  area:DRAM err_code:0008:00c2 socket:1 ha:0 channel_mask:2 rank:1)
[4691842.917010] EDAC sbridge MC1: HANDLING MCE MEMORY ERROR
[4691842.917017] EDAC sbridge MC1: CPU 1: Machine Check Event: 0 Bank 7: 8c00004000010092
[4691842.917018] EDAC sbridge MC1: TSC 0 
[4691842.917021] EDAC sbridge MC1: ADDR dd599c040 
[4691842.917023] EDAC sbridge MC1: MISC 424e3600 
[4691842.917026] EDAC sbridge MC1: PROCESSOR 0:306e4 TIME 1524596869 SOCKET 1 APIC 20
[4691842.917053] EDAC MC1: 1 CE memory read error on CPU_SrcID#1_Ha#0_Chan#2_DIMM#0 (channel:2 slot:0 page:0xdd599c offset:0x40 grain:32 syndrome:0x0 -  area:DRAM err_code:0001:0092 socket:1 ha:0 channel_mask:4 rank:1)
[4691842.917055] EDAC sbridge MC1: HANDLING MCE MEMORY ERROR
[4691842.917058] EDAC sbridge MC1: CPU 1: Machine Check Event: 0 Bank 11: c800188d00800092
[4691842.917059] EDAC sbridge MC1: TSC 0 
[4691842.917061] EDAC sbridge MC1: ADDR 0 
[4691842.917062] EDAC sbridge MC1: MISC c908400180035a8c 
[4691842.917065] EDAC sbridge MC1: PROCESSOR 0:306e4 TIME 1524596869 SOCKET 1 APIC 20
[4692143.685779] CMCI storm subsided: switching to interrupt mode

Event Timeline

wtp2013:~$ sudo ipmi-sel
ID  | Date        | Time     | Name             | Type                     | Event
1   | Jan-15-2015 | 23:04:45 | SEL              | Event Logging Disabled   | Log Area Reset/Cleared
2   | Dec-21-2016 | 01:41:38 | Mem ECC Warning  | Memory                   | transition to Non-Critical from OK ; OEM Event Data2 code = 90h ; OEM Event Data3 code = 80h
3   | Dec-21-2016 | 01:41:39 | Mem ECC Warning  | Memory                   | transition to Critical from less severe ; OEM Event Data2 code = 90h ; OEM Event Data3 code = 80h
4   | Dec-14-2017 | 07:21:25 | Mem ECC Warning  | Memory                   | transition to Non-Critical from OK ; OEM Event Data2 code = 90h ; OEM Event Data3 code = 80h
5   | Dec-14-2017 | 07:21:25 | Mem ECC Warning  | Memory                   | transition to Critical from less severe ; OEM Event Data2 code = 90h ; OEM Event Data3 code = 80h
6   | Feb-22-2018 | 01:27:39 | Mem ECC Warning  | Memory                   | transition to Non-Critical from OK ; OEM Event Data2 code = 90h ; OEM Event Data3 code = 80h
7   | Feb-22-2018 | 03:08:43 | Mem ECC Warning  | Memory                   | transition to Critical from less severe ; OEM Event Data2 code = 90h ; OEM Event Data3 code = 80h
8   | Apr-24-2018 | 17:56:42 | Mem ECC Warning  | Memory                   | transition to Non-Critical from OK ; OEM Event Data2 code = 90h ; OEM Event Data3 code = 80h
9   | Apr-24-2018 | 20:07:26 | Mem ECC Warning  | Memory                   | transition to Critical from less severe ; OEM Event Data2 code = 90h ; OEM Event Data3 code = 80h
Dzahn triaged this task as Medium priority.May 10 2018, 3:03 AM
Vvjjkkii renamed this task from wtp2013 memory correctable errors to lddaaaaaaa.Jul 1 2018, 1:11 AM
Vvjjkkii raised the priority of this task from Medium to High.
Vvjjkkii updated the task description. (Show Details)
Vvjjkkii removed a subscriber: Aklapper.

Mentioned in SAL (#wikimedia-operations) [2019-04-02T16:02:39Z] <mutante> T194174 - bump. started alerting again 2 days ago

wiki_willy renamed this task from wtp2013 memory correctable errors to (OoW) wtp2013 memory correctable errors.Jul 15 2019, 8:53 PM
wiki_willy assigned this task to Papaul.

Log is showing a lot of "Correctable memory error rate exceeded for DIMM_B2". We will have to take the system down and swap the memory in DIMM B2 with a DIMM from one of the decom server onsite for now.

  • Replace DIMM B2
  • Clear log
  • Upgrae BIOS from 2.3 to 2.6

-Upgrade IDRAC from 1.57 to 2.61

All looks good now . Resolving this task