Page MenuHomePhabricator

cp5006 correctable mem errors
Closed, ResolvedPublic

Description

Front panel status LED is blinking amber.

dmesg:

[Thu Jan 10 20:49:53 2019] {1}[Hardware Error]: Hardware error from APEI Generic Hardware Error Source: 4
[Thu Jan 10 20:49:53 2019] {1}[Hardware Error]: It has been corrected by h/w and requires no further action
[Thu Jan 10 20:49:53 2019] {1}[Hardware Error]: event severity: corrected
[Thu Jan 10 20:49:53 2019] {1}[Hardware Error]:  Error 0, type: corrected
[Thu Jan 10 20:49:53 2019] {1}[Hardware Error]:  fru_text: A3
[Thu Jan 10 20:49:53 2019] {1}[Hardware Error]:   section_type: memory error
[Thu Jan 10 20:49:53 2019] {1}[Hardware Error]:   error_status: 0x0000000000000400
[Thu Jan 10 20:49:53 2019] {1}[Hardware Error]:   physical_address: 0x0000000e3fe06e80
[Thu Jan 10 20:49:53 2019] {1}[Hardware Error]:   node: 0 card: 2 module: 0 rank: 0 bank: 0 row: 27640 column: 440 
[Thu Jan 10 20:49:53 2019] {1}[Hardware Error]:   error_type: 2, single-bit ECC
[Thu Jan 10 20:49:53 2019] EDAC sbridge MC0: HANDLING MCE MEMORY ERROR
[Thu Jan 10 20:49:53 2019] EDAC sbridge MC0: CPU 0: Machine Check Event: 0 Bank 255: 940000000000009f
[Thu Jan 10 20:49:53 2019] EDAC sbridge MC0: TSC 6d6a8e930625d9 
[Thu Jan 10 20:49:53 2019] EDAC sbridge MC0: ADDR e3fe06e80 
[Thu Jan 10 20:49:53 2019] EDAC sbridge MC0: MISC 0 
[Thu Jan 10 20:49:53 2019] EDAC sbridge MC0: PROCESSOR 0:406f1 TIME 1547152980 SOCKET 0 APIC 0
[Thu Jan 10 20:49:53 2019] EDAC MC0: 0 CE memory read error on CPU_SrcID#0_Ha#1_Chan#0_DIMM#0 (channel:4 slot:0 page:0xe3fe06 offset:0xe80 grain:32 syndrome:0x0 -  area:DRAM err_code:0000:009f socket:0 ha:1 channel_mask:1 rank:0)
[Thu Jan 10 20:50:55 2019] {2}[Hardware Error]: Hardware error from APEI Generic Hardware Error Source: 4
[Thu Jan 10 20:50:55 2019] {2}[Hardware Error]: It has been corrected by h/w and requires no further action
[Thu Jan 10 20:50:55 2019] {2}[Hardware Error]: event severity: corrected
[Thu Jan 10 20:50:55 2019] {2}[Hardware Error]:  Error 0, type: corrected
[Thu Jan 10 20:50:55 2019] {2}[Hardware Error]:  fru_text: A3
[Thu Jan 10 20:50:55 2019] {2}[Hardware Error]:   section_type: memory error
[Thu Jan 10 20:50:55 2019] {2}[Hardware Error]:   error_status: 0x0000000000000400
[Thu Jan 10 20:50:55 2019] {2}[Hardware Error]:   physical_address: 0x0000003e3fe45880
[Thu Jan 10 20:50:55 2019] {2}[Hardware Error]:   node: 0 card: 2 module: 0 rank: 0 bank: 0 row: 60409 column: 352 
[Thu Jan 10 20:50:55 2019] {2}[Hardware Error]:   error_type: 2, single-bit ECC
[Thu Jan 10 20:50:55 2019] mce: [Hardware Error]: Machine check events logged
[Thu Jan 10 20:50:55 2019] EDAC sbridge MC0: HANDLING MCE MEMORY ERROR
[Thu Jan 10 20:50:55 2019] EDAC sbridge MC0: CPU 0: Machine Check Event: 0 Bank 255: 940000000000009f
[Thu Jan 10 20:50:55 2019] EDAC sbridge MC0: TSC 6d6aae1c3c37a5 
[Thu Jan 10 20:50:55 2019] EDAC sbridge MC0: ADDR 3e3fe45880 
[Thu Jan 10 20:50:55 2019] EDAC sbridge MC0: MISC 0 
[Thu Jan 10 20:50:55 2019] EDAC sbridge MC0: PROCESSOR 0:406f1 TIME 1547153042 SOCKET 0 APIC 0
[Thu Jan 10 20:50:55 2019] EDAC MC0: 0 CE memory read error on CPU_SrcID#0_Ha#1_Chan#0_DIMM#0 (channel:4 slot:0 page:0x3e3fe45 offset:0x880 grain:32 syndrome:0x0 -  area:DRAM err_code:0000:009f socket:0 ha:1 channel_mask:1 rank:0)
[Thu Jan 10 20:53:53 2019] mce: [Hardware Error]: Machine check events logged
[Thu Jan 10 20:57:15 2019] {3}[Hardware Error]: Hardware error from APEI Generic Hardware Error Source: 4
[Thu Jan 10 20:57:15 2019] {3}[Hardware Error]: It has been corrected by h/w and requires no further action
[Thu Jan 10 20:57:15 2019] {3}[Hardware Error]: event severity: corrected
[Thu Jan 10 20:57:15 2019] {3}[Hardware Error]:  Error 0, type: corrected
[Thu Jan 10 20:57:15 2019] {3}[Hardware Error]:  fru_text: A3
[Thu Jan 10 20:57:15 2019] {3}[Hardware Error]:   section_type: memory error
[Thu Jan 10 20:57:15 2019] {3}[Hardware Error]:   error_status: 0x0000000000000400
[Thu Jan 10 20:57:15 2019] {3}[Hardware Error]:   physical_address: 0x000000163fe67800
[Thu Jan 10 20:57:15 2019] {3}[Hardware Error]:   node: 0 card: 2 module: 0 rank: 0 bank: 0 row: 44025 column: 992 
[Thu Jan 10 20:57:15 2019] {3}[Hardware Error]:   error_type: 2, single-bit ECC
[Thu Jan 10 20:57:15 2019] EDAC sbridge MC0: HANDLING MCE MEMORY ERROR
[Thu Jan 10 20:57:15 2019] EDAC sbridge MC0: CPU 0: Machine Check Event: 0 Bank 255: 940000000000009f
[Thu Jan 10 20:57:15 2019] EDAC sbridge MC0: TSC 6d6b70cb567a1d 
[Thu Jan 10 20:57:15 2019] EDAC sbridge MC0: ADDR 163fe67800 
[Thu Jan 10 20:57:15 2019] EDAC sbridge MC0: MISC 0 
[Thu Jan 10 20:57:15 2019] EDAC sbridge MC0: PROCESSOR 0:406f1 TIME 1547153422 SOCKET 0 APIC 0
[Thu Jan 10 20:57:15 2019] EDAC MC0: 0 CE memory read error on CPU_SrcID#0_Ha#1_Chan#0_DIMM#0 (channel:4 slot:0 page:0x163fe67 offset:0x800 grain:32 syndrome:0x0 -  area:DRAM err_code:0000:009f socket:0 ha:1 channel_mask:1 rank:0)
[Thu Jan 10 20:58:03 2019] {4}[Hardware Error]: Hardware error from APEI Generic Hardware Error Source: 4
[Thu Jan 10 20:58:03 2019] {4}[Hardware Error]: It has been corrected by h/w and requires no further action
[Thu Jan 10 20:58:03 2019] {4}[Hardware Error]: event severity: corrected
[Thu Jan 10 20:58:03 2019] {4}[Hardware Error]:  Error 0, type: corrected
[Thu Jan 10 20:58:03 2019] {4}[Hardware Error]:  fru_text: A3
[Thu Jan 10 20:58:03 2019] {4}[Hardware Error]:   section_type: memory error
[Thu Jan 10 20:58:03 2019] {4}[Hardware Error]:   error_status: 0x0000000000000400
[Thu Jan 10 20:58:03 2019] {4}[Hardware Error]:   physical_address: 0x0000003e3fe02780
[Thu Jan 10 20:58:03 2019] {4}[Hardware Error]:   node: 0 card: 2 module: 0 rank: 0 bank: 0 row: 60408 column: 152 
[Thu Jan 10 20:58:03 2019] {4}[Hardware Error]:   error_type: 2, single-bit ECC
[Thu Jan 10 20:58:03 2019] mce: [Hardware Error]: Machine check events logged
[Thu Jan 10 20:58:03 2019] EDAC sbridge MC0: HANDLING MCE MEMORY ERROR
[Thu Jan 10 20:58:03 2019] EDAC sbridge MC0: CPU 0: Machine Check Event: 0 Bank 255: 940000000000009f
[Thu Jan 10 20:58:03 2019] EDAC sbridge MC0: TSC 6d6b894ad357c3 
[Thu Jan 10 20:58:03 2019] EDAC sbridge MC0: ADDR 3e3fe02780 
[Thu Jan 10 20:58:03 2019] EDAC sbridge MC0: MISC 0 
[Thu Jan 10 20:58:03 2019] EDAC sbridge MC0: PROCESSOR 0:406f1 TIME 1547153470 SOCKET 0 APIC 0
[Thu Jan 10 20:58:03 2019] EDAC MC0: 0 CE memory read error on CPU_SrcID#0_Ha#1_Chan#0_DIMM#0 (channel:4 slot:0 page:0x3e3fe02 offset:0x780 grain:32 syndrome:0x0 -  area:DRAM err_code:0000:009f socket:0 ha:1 channel_mask:1 rank:0)
[Thu Jan 10 20:59:04 2019] mce: [Hardware Error]: Machine check events logged
[Thu Jan 10 20:59:17 2019] {5}[Hardware Error]: Hardware error from APEI Generic Hardware Error Source: 4
[Thu Jan 10 20:59:17 2019] {5}[Hardware Error]: It has been corrected by h/w and requires no further action
[Thu Jan 10 20:59:17 2019] {5}[Hardware Error]: event severity: corrected
[Thu Jan 10 20:59:17 2019] {5}[Hardware Error]:  Error 0, type: corrected
[Thu Jan 10 20:59:17 2019] {5}[Hardware Error]:  fru_text: A3
[Thu Jan 10 20:59:17 2019] {5}[Hardware Error]:   section_type: memory error
[Thu Jan 10 20:59:17 2019] {5}[Hardware Error]:   error_status: 0x0000000000000400
[Thu Jan 10 20:59:17 2019] {5}[Hardware Error]:   physical_address: 0x000000263fe67600
[Thu Jan 10 20:59:17 2019] {5}[Hardware Error]:   node: 0 card: 2 module: 0 rank: 0 bank: 0 row: 11257 column: 984 
[Thu Jan 10 20:59:17 2019] {5}[Hardware Error]:   error_type: 2, single-bit ECC
[Thu Jan 10 20:59:17 2019] EDAC sbridge MC0: HANDLING MCE MEMORY ERROR
[Thu Jan 10 20:59:17 2019] EDAC sbridge MC0: CPU 0: Machine Check Event: 0 Bank 255: 940000000000009f
[Thu Jan 10 20:59:17 2019] EDAC sbridge MC0: TSC 6d6baf1fc52de3 
[Thu Jan 10 20:59:17 2019] EDAC sbridge MC0: ADDR 263fe67600 
[Thu Jan 10 20:59:17 2019] EDAC sbridge MC0: MISC 0 
[Thu Jan 10 20:59:17 2019] EDAC sbridge MC0: PROCESSOR 0:406f1 TIME 1547153544 SOCKET 0 APIC 0
[Thu Jan 10 20:59:17 2019] EDAC MC0: 0 CE memory read error on CPU_SrcID#0_Ha#1_Chan#0_DIMM#0 (channel:4 slot:0 page:0x263fe67 offset:0x600 grain:32 syndrome:0x0 -  area:DRAM err_code:0000:009f socket:0 ha:1 channel_mask:1 rank:0)
[Thu Jan 10 21:00:57 2019] {6}[Hardware Error]: Hardware error from APEI Generic Hardware Error Source: 4
[Thu Jan 10 21:00:57 2019] {6}[Hardware Error]: It has been corrected by h/w and requires no further action
[Thu Jan 10 21:00:57 2019] {6}[Hardware Error]: event severity: corrected
[Thu Jan 10 21:00:57 2019] {6}[Hardware Error]:  Error 0, type: corrected
[Thu Jan 10 21:00:57 2019] {6}[Hardware Error]:  fru_text: A3
[Thu Jan 10 21:00:57 2019] {6}[Hardware Error]:   section_type: memory error
[Thu Jan 10 21:00:57 2019] {6}[Hardware Error]:   error_status: 0x0000000000000400
[Thu Jan 10 21:00:57 2019] {6}[Hardware Error]:   physical_address: 0x0000002e3fe22b00
[Thu Jan 10 21:00:57 2019] {6}[Hardware Error]:   node: 0 card: 2 module: 0 rank: 0 bank: 0 row: 27640 column: 680 
[Thu Jan 10 21:00:57 2019] {6}[Hardware Error]:   error_type: 2, single-bit ECC
[Thu Jan 10 21:00:57 2019] mce: [Hardware Error]: Machine check events logged
[Thu Jan 10 21:00:57 2019] EDAC sbridge MC0: HANDLING MCE MEMORY ERROR
[Thu Jan 10 21:00:57 2019] EDAC sbridge MC0: CPU 0: Machine Check Event: 0 Bank 255: 940000000000009f
[Thu Jan 10 21:00:57 2019] EDAC sbridge MC0: TSC 6d6be2268cd187 
[Thu Jan 10 21:00:57 2019] EDAC sbridge MC0: ADDR 2e3fe22b00 
[Thu Jan 10 21:00:57 2019] EDAC sbridge MC0: MISC 0 
[Thu Jan 10 21:00:57 2019] EDAC sbridge MC0: PROCESSOR 0:406f1 TIME 1547153644 SOCKET 0 APIC 0
[Thu Jan 10 21:00:57 2019] EDAC MC0: 0 CE memory read error on CPU_SrcID#0_Ha#1_Chan#0_DIMM#0 (channel:4 slot:0 page:0x2e3fe22 offset:0xb00 grain:32 syndrome:0x0 -  area:DRAM err_code:0000:009f socket:0 ha:1 channel_mask:1 rank:0)
[Thu Jan 10 21:01:48 2019] mce: [Hardware Error]: Machine check events logged
[Thu Jan 10 21:03:06 2019] {7}[Hardware Error]: Hardware error from APEI Generic Hardware Error Source: 4
[Thu Jan 10 21:03:06 2019] {7}[Hardware Error]: It has been corrected by h/w and requires no further action
[Thu Jan 10 21:03:06 2019] {7}[Hardware Error]: event severity: corrected
[Thu Jan 10 21:03:06 2019] {7}[Hardware Error]:  Error 0, type: corrected
[Thu Jan 10 21:03:06 2019] {7}[Hardware Error]:  fru_text: A3
[Thu Jan 10 21:03:06 2019] {7}[Hardware Error]:   section_type: memory error
[Thu Jan 10 21:03:06 2019] {7}[Hardware Error]:   error_status: 0x0000000000000400
[Thu Jan 10 21:03:06 2019] {7}[Hardware Error]:   physical_address: 0x0000002e3fe27100
[Thu Jan 10 21:03:06 2019] {7}[Hardware Error]:   node: 0 card: 2 module: 0 rank: 0 bank: 0 row: 27640 column: 960 
[Thu Jan 10 21:03:06 2019] {7}[Hardware Error]:   error_type: 2, single-bit ECC
[Thu Jan 10 21:03:06 2019] EDAC sbridge MC0: HANDLING MCE MEMORY ERROR
[Thu Jan 10 21:03:06 2019] EDAC sbridge MC0: CPU 0: Machine Check Event: 0 Bank 255: 940000000000009f
[Thu Jan 10 21:03:06 2019] EDAC sbridge MC0: TSC 6d6c243fe28164 
[Thu Jan 10 21:03:06 2019] EDAC sbridge MC0: ADDR 2e3fe27100 
[Thu Jan 10 21:03:06 2019] EDAC sbridge MC0: MISC 0 
[Thu Jan 10 21:03:06 2019] EDAC sbridge MC0: PROCESSOR 0:406f1 TIME 1547153773 SOCKET 0 APIC 0
[Thu Jan 10 21:03:06 2019] EDAC MC0: 0 CE memory read error on CPU_SrcID#0_Ha#1_Chan#0_DIMM#0 (channel:4 slot:0 page:0x2e3fe27 offset:0x100 grain:32 syndrome:0x0 -  area:DRAM err_code:0000:009f socket:0 ha:1 channel_mask:1 rank:0)
[Thu Jan 10 21:04:16 2019] mce: [Hardware Error]: Machine check events logged
[Thu Jan 10 21:11:27 2019] {8}[Hardware Error]: Hardware error from APEI Generic Hardware Error Source: 4
[Thu Jan 10 21:11:27 2019] {8}[Hardware Error]: It has been corrected by h/w and requires no further action
[Thu Jan 10 21:11:27 2019] {8}[Hardware Error]: event severity: corrected
[Thu Jan 10 21:11:27 2019] {8}[Hardware Error]:  Error 0, type: corrected
[Thu Jan 10 21:11:27 2019] {8}[Hardware Error]:  fru_text: A3
[Thu Jan 10 21:11:27 2019] {8}[Hardware Error]:   section_type: memory error
[Thu Jan 10 21:11:27 2019] {8}[Hardware Error]:   error_status: 0x0000000000000400
[Thu Jan 10 21:11:27 2019] {8}[Hardware Error]:   physical_address: 0x000000263fe44980
[Thu Jan 10 21:11:27 2019] {8}[Hardware Error]:   node: 0 card: 2 module: 0 rank: 0 bank: 0 row: 11257 column: 288 
[Thu Jan 10 21:11:27 2019] {8}[Hardware Error]:   error_type: 2, single-bit ECC
[Thu Jan 10 21:11:27 2019] EDAC sbridge MC0: HANDLING MCE MEMORY ERROR
[Thu Jan 10 21:11:27 2019] EDAC sbridge MC0: CPU 0: Machine Check Event: 0 Bank 255: 940000000000009f
[Thu Jan 10 21:11:27 2019] EDAC sbridge MC0: TSC 6d6d24503a628c 
[Thu Jan 10 21:11:27 2019] EDAC sbridge MC0: ADDR 263fe44980 
[Thu Jan 10 21:11:27 2019] EDAC sbridge MC0: MISC 0 
[Thu Jan 10 21:11:27 2019] EDAC sbridge MC0: PROCESSOR 0:406f1 TIME 1547154273 SOCKET 0 APIC 0
[Thu Jan 10 21:11:27 2019] EDAC MC0: 0 CE memory read error on CPU_SrcID#0_Ha#1_Chan#0_DIMM#0 (channel:4 slot:0 page:0x263fe44 offset:0x980 grain:32 syndrome:0x0 -  area:DRAM err_code:0000:009f socket:0 ha:1 channel_mask:1 rank:0)
[Thu Jan 10 21:12:11 2019] mce: [Hardware Error]: Machine check events logged
[Thu Jan 10 21:12:57 2019] {9}[Hardware Error]: Hardware error from APEI Generic Hardware Error Source: 4
[Thu Jan 10 21:12:57 2019] {9}[Hardware Error]: It has been corrected by h/w and requires no further action
[Thu Jan 10 21:12:57 2019] {9}[Hardware Error]: event severity: corrected
[Thu Jan 10 21:12:57 2019] {9}[Hardware Error]:  Error 0, type: corrected
[Thu Jan 10 21:12:57 2019] {9}[Hardware Error]:  fru_text: A3
[Thu Jan 10 21:12:57 2019] {9}[Hardware Error]:   section_type: memory error
[Thu Jan 10 21:12:57 2019] {9}[Hardware Error]:   error_status: 0x0000000000000400
[Thu Jan 10 21:12:57 2019] {9}[Hardware Error]:   physical_address: 0x000000063fe67a00
[Thu Jan 10 21:12:57 2019] {9}[Hardware Error]:   node: 0 card: 2 module: 0 rank: 0 bank: 0 row: 11257 column: 1000 
[Thu Jan 10 21:12:57 2019] {9}[Hardware Error]:   error_type: 2, single-bit ECC
[Thu Jan 10 21:12:57 2019] EDAC sbridge MC0: HANDLING MCE MEMORY ERROR
[Thu Jan 10 21:12:57 2019] EDAC sbridge MC0: CPU 0: Machine Check Event: 0 Bank 255: 940000000000009f
[Thu Jan 10 21:12:57 2019] EDAC sbridge MC0: TSC 6d6d52697072d2 
[Thu Jan 10 21:12:57 2019] EDAC sbridge MC0: ADDR 63fe67a00 
[Thu Jan 10 21:12:57 2019] EDAC sbridge MC0: MISC 0 
[Thu Jan 10 21:12:57 2019] EDAC sbridge MC0: PROCESSOR 0:406f1 TIME 1547154363 SOCKET 0 APIC 0
[Thu Jan 10 21:12:57 2019] EDAC MC0: 0 CE memory read error on CPU_SrcID#0_Ha#1_Chan#0_DIMM#0 (channel:4 slot:0 page:0x63fe67 offset:0xa00 grain:32 syndrome:0x0 -  area:DRAM err_code:0000:009f socket:0 ha:1 channel_mask:1 rank:0)
[Thu Jan 10 21:14:38 2019] mce: [Hardware Error]: Machine check events logged
[Thu Jan 10 21:26:33 2019] {10}[Hardware Error]: Hardware error from APEI Generic Hardware Error Source: 4
[Thu Jan 10 21:26:33 2019] {10}[Hardware Error]: It has been corrected by h/w and requires no further action
[Thu Jan 10 21:26:33 2019] {10}[Hardware Error]: event severity: corrected
[Thu Jan 10 21:26:33 2019] {10}[Hardware Error]:  Error 0, type: corrected
[Thu Jan 10 21:26:33 2019] {10}[Hardware Error]:  fru_text: A3
[Thu Jan 10 21:26:33 2019] {10}[Hardware Error]:   section_type: memory error
[Thu Jan 10 21:26:33 2019] {10}[Hardware Error]:   error_status: 0x0000000000000400
[Thu Jan 10 21:26:33 2019] {10}[Hardware Error]:   physical_address: 0x000000163fe00180
[Thu Jan 10 21:26:33 2019] {10}[Hardware Error]:   node: 0 card: 2 module: 0 rank: 0 bank: 0 row: 44024 column: 0 
[Thu Jan 10 21:26:33 2019] {10}[Hardware Error]:   error_type: 2, single-bit ECC
[Thu Jan 10 21:26:33 2019] EDAC sbridge MC0: HANDLING MCE MEMORY ERROR
[Thu Jan 10 21:26:33 2019] EDAC sbridge MC0: CPU 0: Machine Check Event: 0 Bank 255: 940000000000009f
[Thu Jan 10 21:26:33 2019] EDAC sbridge MC0: TSC 6d6ef4156bedcb 
[Thu Jan 10 21:26:33 2019] EDAC sbridge MC0: ADDR 163fe00180 
[Thu Jan 10 21:26:33 2019] EDAC sbridge MC0: MISC 0 
[Thu Jan 10 21:26:33 2019] EDAC sbridge MC0: PROCESSOR 0:406f1 TIME 1547155180 SOCKET 0 APIC 0
[Thu Jan 10 21:26:33 2019] EDAC MC0: 0 CE memory read error on CPU_SrcID#0_Ha#1_Chan#0_DIMM#0 (channel:4 slot:0 page:0x163fe00 offset:0x180 grain:32 syndrome:0x0 -  area:DRAM err_code:0000:009f socket:0 ha:1 channel_mask:1 rank:0)
[Thu Jan 10 21:27:45 2019] mce: [Hardware Error]: Machine check events logged
[Thu Jan 10 21:35:27 2019] {11}[Hardware Error]: Hardware error from APEI Generic Hardware Error Source: 4
[Thu Jan 10 21:35:27 2019] {11}[Hardware Error]: It has been corrected by h/w and requires no further action
[Thu Jan 10 21:35:27 2019] {11}[Hardware Error]: event severity: corrected
[Thu Jan 10 21:35:27 2019] {11}[Hardware Error]:  Error 0, type: corrected
[Thu Jan 10 21:35:27 2019] {11}[Hardware Error]:  fru_text: A3
[Thu Jan 10 21:35:27 2019] {11}[Hardware Error]:   section_type: memory error
[Thu Jan 10 21:35:27 2019] {11}[Hardware Error]:   error_status: 0x0000000000000400
[Thu Jan 10 21:35:27 2019] {11}[Hardware Error]:   physical_address: 0x000000163fe00180
[Thu Jan 10 21:35:27 2019] {11}[Hardware Error]:   node: 0 card: 2 module: 0 rank: 0 bank: 0 row: 44024 column: 0 
[Thu Jan 10 21:35:27 2019] {11}[Hardware Error]:   error_type: 2, single-bit ECC
[Thu Jan 10 21:35:27 2019] EDAC sbridge MC0: HANDLING MCE MEMORY ERROR
[Thu Jan 10 21:35:27 2019] EDAC sbridge MC0: CPU 0: Machine Check Event: 0 Bank 255: 940000000000009f
[Thu Jan 10 21:35:27 2019] EDAC sbridge MC0: TSC 6d7004fc891eef 
[Thu Jan 10 21:35:27 2019] EDAC sbridge MC0: ADDR 163fe00180 
[Thu Jan 10 21:35:27 2019] EDAC sbridge MC0: MISC 0 
[Thu Jan 10 21:35:27 2019] EDAC sbridge MC0: PROCESSOR 0:406f1 TIME 1547155713 SOCKET 0 APIC 0
[Thu Jan 10 21:35:27 2019] EDAC MC0: 0 CE memory read error on CPU_SrcID#0_Ha#1_Chan#0_DIMM#0 (channel:4 slot:0 page:0x163fe00 offset:0x180 grain:32 syndrome:0x0 -  area:DRAM err_code:0000:009f socket:0 ha:1 channel_mask:1 rank:0)
[Thu Jan 10 21:35:40 2019] mce: [Hardware Error]: Machine check events logged

SEL:

-------------------------------------------------------------------------------
Record:      6
Date/Time:   01/10/2019 20:19:09
Source:      system
Severity:    Non-Critical
Description: Correctable memory error rate exceeded for DIMM_A3.
-------------------------------------------------------------------------------
Record:      7
Date/Time:   01/10/2019 20:23:50
Source:      system
Severity:    Critical
Description: Correctable memory error rate exceeded for DIMM_A3.
-------------------------------------------------------------------------------

Related Objects

Event Timeline

BBlack created this task.Feb 21 2019, 2:21 PM
Restricted Application added a project: Operations. · View Herald TranscriptFeb 21 2019, 2:21 PM
Restricted Application added a subscriber: Aklapper. · View Herald Transcript
RobH added a subscriber: RobH.Feb 21 2019, 4:52 PM

Please note that Dell support typically requires the following steps to be taken for any memory replacement:

  • Update bios firmware on host to latest revision
    • current version is 2.9.1, installed version is 2.7.1
  • Move memory to a different dimm slot, if error follows memory it is a bad dim, if it stays in the slot its a bad mainboard
    • Hitting F11 during post allows the running of system hardware tests.
  • clear the SEL of all errors BEFORE running dell diagnostics (diagnostics will fail if ANY errors are in the SEL)
  • generate dell support log export AFTER the memory dimm swap and firmware update, as those actions will be visible in the report.

Drivers link: https://www.dell.com/support/home/us/en/19/product-support/servicetag/2b9p9m2/drivers

Please note that cp5006 has a repair history, its mainboard was swapped out on T187157.

Mentioned in SAL (#wikimedia-operations) [2019-02-21T17:10:24Z] <robh> rebooting cp5006 to flash bios in memory troubleshooting steps via T216717

RobH added a comment.Feb 21 2019, 5:40 PM

I've updated the bios to the latest revision, 2.9.1

POST shows no errors, but I'm going to wipe the SEL and run (dells) hardware test suite.

RobH added a comment.Feb 21 2019, 5:40 PM
/admin1-> racadm getsel
Record:      1
Date/Time:   07/25/2018 16:19:36
Source:      system
Severity:    Ok
Description: Log cleared.
-------------------------------------------------------------------------------
Record:      2
Date/Time:   10/12/2018 14:10:03
Source:      system
Severity:    Critical
Description: The power input for power supply 1 is lost.
-------------------------------------------------------------------------------
Record:      3
Date/Time:   10/12/2018 20:37:23
Source:      system
Severity:    Ok
Description: The input power for power supply 1 has been restored.
-------------------------------------------------------------------------------
Record:      4
Date/Time:   10/14/2018 14:09:03
Source:      system
Severity:    Critical
Description: The power input for power supply 2 is lost.
-------------------------------------------------------------------------------
Record:      5
Date/Time:   10/14/2018 19:01:23
Source:      system
Severity:    Ok
Description: The input power for power supply 2 has been restored.
-------------------------------------------------------------------------------
Record:      6
Date/Time:   01/10/2019 20:19:09
Source:      system
Severity:    Non-Critical
Description: Correctable memory error rate exceeded for DIMM_A3.
-------------------------------------------------------------------------------
Record:      7
Date/Time:   01/10/2019 20:23:50
Source:      system
Severity:    Critical
Description: Correctable memory error rate exceeded for DIMM_A3.
-------------------------------------------------------------------------------
/admin1->
RobH added a comment.Feb 21 2019, 5:53 PM

Ok, running task comment of steps taken:

  • updated bios
  • rebooted into hardware tests
  • POST shows no memory errors:
Testing Memory...
Testing Memory... 10% Complete
Testing Memory... 20% Complete
Testing Memory... 30% Complete
Testing Memory... 40% Complete
Testing Memory... 50% Complete
Testing Memory... 60% Complete
Testing Memory... 70% Complete
Testing Memory... 80% Complete
Testing Memory... 90% Complete
Testing Memory... Done [No Errors]
  • hardware testing run
RobH added a subscriber: ayounsi.Feb 21 2019, 6:14 PM

Please note the hardware testing is still running on this system. I'm monitoring its serial output, but @ayounsi shouldn't modify the system until I update (or unless he attaches a crash cart and confirms its not running the test any longer.)

RobH assigned this task to ayounsi.Feb 21 2019, 9:45 PM

This passed all in depth Dell hardware test utilities, and issued no further errors since I cleared the log and ran hardware tests.

Since our onsite time is limited, I'd recommend Arzhel move the A3 dimm to another slot, and note it on the task. That way if the dimm error shows up on the new slot, we know its a bad dimm. If the error shows up again on A3, we know its the mainboard.

Mentioned in SAL (#wikimedia-operations) [2019-02-22T01:40:37Z] <XioNoX> power-down cp5006 - T216717

Swapped A3 and A4

ayounsi reassigned this task from ayounsi to RobH.Feb 22 2019, 2:14 AM

Mentioned in SAL (#wikimedia-operations) [2019-02-22T17:13:06Z] <bblack> cp5006: repooling into service - T216717

RobH added a comment.Feb 25 2019, 7:12 PM

As of 2019-02-25 @ 19:12 there are no memory errors logged post dimm slot swap.

ema moved this task from Triage to Hardware on the Traffic board.Mar 6 2019, 9:56 AM
ema triaged this task as Normal priority.Mar 6 2019, 10:11 AM
ema added a subscriber: ema.

Anything else to be done here?

RobH changed the task status from Open to Stalled.Mar 6 2019, 5:05 PM
RobH lowered the priority of this task from Normal to Low.
RobH moved this task from Backlog to Hardware Failure / Troubleshoot on the ops-eqsin board.

I'm keeping this open for a month after the swap. If no further errors are logged (need to manually check the SEL) by March 25th, this can be resolved.

It is assigned to me though, so it shouldn't be blocking anyone else.

ayounsi removed a subscriber: ayounsi.Mar 6 2019, 5:33 PM
RobH closed this task as Resolved.Jul 3 2019, 6:22 PM
/admin1-> racadm getsel
Record:      1
Date/Time:   02/21/2019 17:41:01
Source:      system
Severity:    Ok
Description: Log cleared.
-------------------------------------------------------------------------------
Record:      2
Date/Time:   02/22/2019 01:51:23
Source:      system
Severity:    Critical
Description: The chassis is open while the power is off.
-------------------------------------------------------------------------------
Record:      3
Date/Time:   02/22/2019 01:51:28
Source:      system
Severity:    Ok
Description: The chassis is closed while the power is off.
-------------------------------------------------------------------------------

No errors detected months later, resolving.

Restricted Application added a subscriber: Liuxinyu970226. · View Herald TranscriptJul 3 2019, 6:22 PM