Page MenuHomePhabricator

cp1068 memory correctable errors
Closed, DeclinedPublic

Description

Some correctable errors showed up in kernel logs:

kern.log:May 14 02:26:11 cp1068 kernel: [3434689.135364] EDAC sbridge MC1: HANDLING MCE MEMORY ERROR
kern.log:May 14 02:26:11 cp1068 kernel: [3434689.135366] EDAC sbridge MC1: CPU 1: Machine Check Event: 0 Bank 5: 8c00004000010091
kern.log:May 14 02:26:11 cp1068 kernel: [3434689.135367] EDAC sbridge MC1: TSC 0 
kern.log:May 14 02:26:11 cp1068 kernel: [3434689.135368] EDAC sbridge MC1: ADDR 2bfd4524c0 
kern.log:May 14 02:26:11 cp1068 kernel: [3434689.135369] EDAC sbridge MC1: MISC 244052d286 
kern.log:May 14 02:26:11 cp1068 kernel: [3434689.135371] EDAC sbridge MC1: PROCESSOR 0:206d7 TIME 1526264771 SOCKET 1 APIC 20
kern.log:May 14 02:26:11 cp1068 kernel: [3434689.135388] EDAC MC1: 1 CE memory read error on CPU_SrcID#1_Ha#0_Chan#1_DIMM#0 (channel:1 slot:0 page:0x2bfd452 offset:0x4c0 grain:32 syndrome:0x0 -  area:DRAM err_code:0001:0091 socket:1 ha:0 channel_mask:2 rank:0)
kern.log:May 14 02:26:11 cp1068 kernel: [3434689.135389] EDAC sbridge MC1: HANDLING MCE MEMORY ERROR
kern.log:May 14 02:26:11 cp1068 kernel: [3434689.135391] EDAC sbridge MC1: CPU 1: Machine Check Event: 0 Bank 9: 8800004200800091
kern.log:May 14 02:26:11 cp1068 kernel: [3434689.135392] EDAC sbridge MC1: TSC 0 
kern.log:May 14 02:26:11 cp1068 kernel: [3434689.135393] EDAC sbridge MC1: ADDR 0 
kern.log:May 14 02:26:11 cp1068 kernel: [3434689.135393] EDAC sbridge MC1: MISC 490010001000048c 
kern.log:May 14 02:26:11 cp1068 kernel: [3434689.135395] EDAC sbridge MC1: PROCESSOR 0:206d7 TIME 1526264771 SOCKET 1 APIC 20
kern.log:May 14 15:40:02 cp1068 kernel: [3482320.523289] EDAC sbridge MC1: HANDLING MCE MEMORY ERROR
kern.log:May 14 15:40:02 cp1068 kernel: [3482320.523291] EDAC sbridge MC1: CPU 1: Machine Check Event: 0 Bank 9: 8c000042000800c1
kern.log:May 14 15:40:02 cp1068 kernel: [3482320.523295] EDAC sbridge MC1: TSC 0 
kern.log:May 14 15:40:02 cp1068 kernel: [3482320.523297] EDAC sbridge MC1: ADDR 2bfd452000 
kern.log:May 14 15:40:02 cp1068 kernel: [3482320.523298] EDAC sbridge MC1: MISC 90010001000048c 
kern.log:May 14 15:40:02 cp1068 kernel: [3482320.523299] EDAC sbridge MC1: PROCESSOR 0:206d7 TIME 1526312402 SOCKET 1 APIC 20
kern.log:May 14 15:40:02 cp1068 kernel: [3482320.523317] EDAC MC1: 1 CE memory scrubbing error on CPU_SrcID#1_Ha#0_Chan#0_DIMM#0 (channel:0 slot:0 page:0x2bfd452 offset:0x0 grain:32 syndrome:0x0 -  area:DRAM err_code:0008:00c1 socket:1 ha:0 channel_mask:1 rank:0)
kern.log:May 15 08:44:08 cp1068 kernel: [3543766.787370] EDAC sbridge MC1: HANDLING MCE MEMORY ERROR
kern.log:May 15 08:44:08 cp1068 kernel: [3543766.787374] EDAC sbridge MC1: CPU 1: Machine Check Event: 0 Bank 9: 8c000042000800c1
kern.log:May 15 08:44:08 cp1068 kernel: [3543766.787374] EDAC sbridge MC1: TSC 0 
kern.log:May 15 08:44:08 cp1068 kernel: [3543766.787376] EDAC sbridge MC1: ADDR 2bfd452000 
kern.log:May 15 08:44:08 cp1068 kernel: [3543766.787376] EDAC sbridge MC1: MISC 90010001000048c 
kern.log:May 15 08:44:08 cp1068 kernel: [3543766.787378] EDAC sbridge MC1: PROCESSOR 0:206d7 TIME 1526373848 SOCKET 1 APIC 20
kern.log:May 15 08:44:08 cp1068 kernel: [3543766.787395] EDAC MC1: 1 CE memory scrubbing error on CPU_SrcID#1_Ha#0_Chan#0_DIMM#0 (channel:0 slot:0 page:0x2bfd452 offset:0x0 grain:32 syndrome:0x0 -  area:DRAM err_code:0008:00c1 socket:1 ha:0 channel_mask:1 rank:0)
kern.log.1:May  9 10:26:50 cp1068 kernel: [3031524.953853] EDAC MC: Removed device 0 for sbridge_edac.c Sandy Bridge Socket#0: DEV 0000:3f:0e.0
kern.log.1:May  9 10:26:50 cp1068 kernel: [3031524.997914] EDAC MC: Removed device 1 for sbridge_edac.c Sandy Bridge Socket#1: DEV 0000:7f:0e.0
kern.log.1:May  9 10:26:53 cp1068 kernel: [3031528.369918] EDAC MC: Ver: 3.0.0

Event Timeline

ema triaged this task as Medium priority.May 16 2018, 8:40 AM
ema moved this task from Backlog to Hardware on the Traffic board.
Vvjjkkii renamed this task from cp1068 memory correctable errors to excaaaaaaa.Jul 1 2018, 1:10 AM
Vvjjkkii raised the priority of this task from Medium to High.
Vvjjkkii updated the task description. (Show Details)
Vvjjkkii removed a subscriber: Aklapper.

The server will need to be powered down to reseat DIMM...please schedule a day/time with me.

BBlack subscribed.

Let's just skip this, it's one of the servers we'll be decomming once cp1075-90 are rolled into service.