Page MenuHomePhabricator

db1107 (eventlogging db master) possibly memory issues
Closed, ResolvedPublic

Description

Reported on icinga:

[06:21:11]  <+icinga-wm>	PROBLEM - EDAC syslog messages on db1107 is CRITICAL: 16.36 ge 4 https://grafana.wikimedia.org/dashboard/db/host-overview?orgId=1&var-server=db1107&var-datasource=eqiad+prometheus/ops
[8765665.177485] {1}[Hardware Error]: Hardware error from APEI Generic Hardware Error Source: 4
[8765665.177487] {1}[Hardware Error]: It has been corrected by h/w and requires no further action
[8765665.177489] {1}[Hardware Error]: event severity: corrected
[8765665.177491] {1}[Hardware Error]:  Error 0, type: corrected
[8765665.177492] {1}[Hardware Error]:  fru_text: A3
[8765665.177494] {1}[Hardware Error]:   section_type: memory error
[8765665.177495] {1}[Hardware Error]:   error_status: 0x0000000000000400
[8765665.177496] {1}[Hardware Error]:   physical_address: 0x0000003007d50140
[8765665.177500] {1}[Hardware Error]:   node: 0 card: 2 module: 0 rank: 1 bank: 2 row: 30845 column: 0
[8765665.177502] {1}[Hardware Error]:   error_type: 2, single-bit ECC
[8765665.177521] EDAC sbridge MC0: HANDLING MCE MEMORY ERROR
[8765665.177523] EDAC sbridge MC0: CPU 0: Machine Check Event: 0 Bank 255: 940000000000009f
[8765665.177525] EDAC sbridge MC0: TSC 6cfdf37293c04f
[8765665.177526] EDAC sbridge MC0: ADDR 3007d50140
[8765665.177527] EDAC sbridge MC0: MISC 0
[8765665.177530] EDAC sbridge MC0: PROCESSOR 0:406f1 TIME 1556509734 SOCKET 0 APIC 0
[8765665.177550] EDAC MC0: 0 CE memory read error on CPU_SrcID#0_Ha#0_Chan#2_DIMM#0 (channel:2 slot:0 page:0x3007d50 offset:0x140 grain:32 syndrome:0x0 -  area:DRAM err_code:0000:009f socket:0 ha:0 channel_mask:4 rank:1)
[8765666.600256] {2}[Hardware Error]: Hardware error from APEI Generic Hardware Error Source: 4
[8765666.600259] {2}[Hardware Error]: It has been corrected by h/w and requires no further action
[8765666.600262] {2}[Hardware Error]: event severity: corrected
[8765666.600264] {2}[Hardware Error]:  Error 0, type: corrected
[8765666.600266] {2}[Hardware Error]:  fru_text: A3
[8765666.600268] {2}[Hardware Error]:   section_type: memory error
[8765666.600271] {2}[Hardware Error]:   error_status: 0x0000000000000400
[8765666.600273] {2}[Hardware Error]:   physical_address: 0x0000003064e10180
[8765666.600279] {2}[Hardware Error]:   node: 0 card: 2 module: 0 rank: 1 bank: 2 row: 32336 column: 0
[8765666.600281] {2}[Hardware Error]:   error_type: 2, single-bit ECC
[8765666.600295] mce: [Hardware Error]: Machine check events logged
[8765666.600310] EDAC sbridge MC0: HANDLING MCE MEMORY ERROR
[8765666.600314] EDAC sbridge MC0: CPU 0: Machine Check Event: 0 Bank 255: 940000000000009f
[8765666.600316] EDAC sbridge MC0: TSC 6cfdf49b6021f1
[8765666.600319] EDAC sbridge MC0: ADDR 3064e10180
[8765666.600320] EDAC sbridge MC0: MISC 0
[8765666.600324] EDAC sbridge MC0: PROCESSOR 0:406f1 TIME 1556509735 SOCKET 0 APIC 0
[8765666.600359] EDAC MC0: 0 CE memory read error on CPU_SrcID#0_Ha#0_Chan#2_DIMM#0 (channel:2 slot:0 page:0x3064e10 offset:0x180 grain:32 syndrome:0x0 -  area:DRAM err_code:0000:009f socket:0 ha:0 channel_mask:4 rank:1)
[8765667.162473] mce: [Hardware Error]: Machine check events logged
[8765667.162486] EDAC sbridge MC0: HANDLING MCE MEMORY ERROR
[8765667.162490] EDAC sbridge MC0: CPU 0: Machine Check Event: 0 Bank 255: 940000000000009f
[8765667.162492] EDAC sbridge MC0: TSC 6cfdf510a618ee
[8765667.162494] EDAC sbridge MC0: ADDR 2c68250100
[8765667.162495] EDAC sbridge MC0: MISC 0
[8765667.162498] EDAC sbridge MC0: PROCESSOR 0:406f1 TIME 1556509736 SOCKET 0 APIC 0
[8765667.162517] EDAC MC0: 0 CE memory read error on CPU_SrcID#0_Ha#0_Chan#2_DIMM#0 (channel:2 slot:0 page:0x2c68250 offset:0x100 grain:32 syndrome:0x0 -  area:DRAM err_code:0000:009f socket:0 ha:0 channel_mask:4 rank:1)
[8765668.471777] EDAC sbridge MC0: HANDLING MCE MEMORY ERROR
[8765668.471781] EDAC sbridge MC0: CPU 0: Machine Check Event: 0 Bank 255: 940000000000009f
[8765668.471783] EDAC sbridge MC0: TSC 6cfdf621c6534a
[8765668.471784] EDAC sbridge MC0: ADDR 2b1d6501c0
[8765668.471785] EDAC sbridge MC0: MISC 0
[8765668.471787] EDAC sbridge MC0: PROCESSOR 0:406f1 TIME 1556509737 SOCKET 0 APIC 0
[8765668.471804] EDAC MC0: 0 CE memory read error on CPU_SrcID#0_Ha#0_Chan#2_DIMM#0 (channel:2 slot:0 page:0x2b1d650 offset:0x1c0 grain:32 syndrome:0x0 -  area:DRAM err_code:0000:009f socket:0 ha:0 channel_mask:4 rank:1)
[8765669.021046] EDAC sbridge MC0: HANDLING MCE MEMORY ERROR
[8765669.021051] EDAC sbridge MC0: CPU 0: Machine Check Event: 0 Bank 255: 940000000000009f
[8765669.021053] EDAC sbridge MC0: TSC 6cfdf6945ac12f
[8765669.021054] EDAC sbridge MC0: ADDR 2b6d7d0040
[8765669.021056] EDAC sbridge MC0: MISC 0
[8765669.021058] EDAC sbridge MC0: PROCESSOR 0:406f1 TIME 1556509738 SOCKET 0 APIC 0
[8765669.021078] EDAC MC0: 0 CE memory read error on CPU_SrcID#0_Ha#0_Chan#2_DIMM#0 (channel:2 slot:0 page:0x2b6d7d0 offset:0x40 grain:32 syndrome:0x0 -  area:DRAM err_code:0000:009f socket:0 ha:0 channel_mask:4 rank:1)
[8765671.646384] ghes_print_estatus: 3 callbacks suppressed
[8765671.646387] {3}[Hardware Error]: Hardware error from APEI Generic Hardware Error Source: 4
[8765671.646389] {3}[Hardware Error]: It has been corrected by h/w and requires no further action
[8765671.646390] {3}[Hardware Error]: event severity: corrected
[8765671.646393] {3}[Hardware Error]:  Error 0, type: corrected
[8765671.646395] {3}[Hardware Error]:  fru_text: A3
[8765671.646396] {3}[Hardware Error]:   section_type: memory error
[8765671.646398] {3}[Hardware Error]:   error_status: 0x0000000000000400
[8765671.646399] {3}[Hardware Error]:   physical_address: 0x000000303a1d0080
[8765671.646403] {3}[Hardware Error]:   node: 0 card: 2 module: 0 rank: 1 bank: 2 row: 31671 column: 0
[8765671.646404] {3}[Hardware Error]:   error_type: 2, single-bit ECC
[8765671.646431] EDAC sbridge MC0: HANDLING MCE MEMORY ERROR
[8765671.646433] EDAC sbridge MC0: CPU 0: Machine Check Event: 0 Bank 255: 940000000000009f
[8765671.646435] EDAC sbridge MC0: TSC 6cfdf8b8052f87
[8765671.646437] EDAC sbridge MC0: ADDR 303a1d0080
[8765671.646442] EDAC sbridge MC0: MISC 0
[8765671.646448] EDAC sbridge MC0: PROCESSOR 0:406f1 TIME 1556509740 SOCKET 0 APIC 0
[8765671.646469] EDAC MC0: 0 CE memory read error on CPU_SrcID#0_Ha#0_Chan#2_DIMM#0 (channel:2 slot:0 page:0x303a1d0 offset:0x80 grain:32 syndrome:0x0 -  area:DRAM err_code:0000:009f socket:0 ha:0 channel_mask:4 rank:1)
[8765672.579945] {4}[Hardware Error]: Hardware error from APEI Generic Hardware Error Source: 4
[8765672.579948] {4}[Hardware Error]: It has been corrected by h/w and requires no further action
[8765672.579950] {4}[Hardware Error]: event severity: corrected
[8765672.579952] {4}[Hardware Error]:  Error 0, type: corrected
[8765672.579953] {4}[Hardware Error]:  fru_text: A3
[8765672.579955] {4}[Hardware Error]:   section_type: memory error
[8765672.579956] {4}[Hardware Error]:   error_status: 0x0000000000000400
[8765672.579958] {4}[Hardware Error]:   physical_address: 0x0000002c7e450100
[8765672.579962] {4}[Hardware Error]:   node: 0 card: 2 module: 0 rank: 1 bank: 2 row: 24561 column: 0
[8765672.579963] {4}[Hardware Error]:   error_type: 2, single-bit ECC
[8765672.579989] EDAC sbridge MC0: HANDLING MCE MEMORY ERROR
[8765672.579994] EDAC sbridge MC0: CPU 0: Machine Check Event: 0 Bank 255: 940000000000009f
[8765672.580001] EDAC sbridge MC0: TSC 6cfdf97ac3d24a
[8765672.580005] EDAC sbridge MC0: ADDR 2c7e450100
[8765672.580011] EDAC sbridge MC0: MISC 0
[8765672.580016] EDAC sbridge MC0: PROCESSOR 0:406f1 TIME 1556509741 SOCKET 0 APIC 0
[8765672.580038] EDAC MC0: 0 CE memory read error on CPU_SrcID#0_Ha#0_Chan#2_DIMM#0 (channel:2 slot:0 page:0x2c7e450 offset:0x100 grain:32 syndrome:0x0 -  area:DRAM err_code:0000:009f socket:0 ha:0 channel_mask:4 rank:1)
[8765675.862421] EDAC sbridge MC0: HANDLING MCE MEMORY ERROR
[8765675.862426] EDAC sbridge MC0: CPU 0: Machine Check Event: 0 Bank 255: 940000000000009f
[8765675.862428] EDAC sbridge MC0: TSC 6cfdfc277f1f35
[8765675.862430] EDAC sbridge MC0: ADDR 2f6a090080
[8765675.862432] EDAC sbridge MC0: MISC 0
[8765675.862435] EDAC sbridge MC0: PROCESSOR 0:406f1 TIME 1556509745 SOCKET 0 APIC 0
[8765675.862459] EDAC MC0: 0 CE memory read error on CPU_SrcID#0_Ha#0_Chan#2_DIMM#0 (channel:2 slot:0 page:0x2f6a090 offset:0x80 grain:32 syndrome:0x0 -  area:DRAM err_code:0000:009f socket:0 ha:0 channel_mask:4 rank:1)
[8765705.618856] ghes_print_estatus: 1 callbacks suppressed
[8765705.618857] {5}[Hardware Error]: Hardware error from APEI Generic Hardware Error Source: 4
[8765705.618859] {5}[Hardware Error]: It has been corrected by h/w and requires no further action
[8765705.618860] {5}[Hardware Error]: event severity: corrected
[8765705.618862] {5}[Hardware Error]:  Error 0, type: corrected
[8765705.618863] {5}[Hardware Error]:  fru_text: A3
[8765705.618863] {5}[Hardware Error]:   section_type: memory error
[8765705.618864] {5}[Hardware Error]:   error_status: 0x0000000000000400
[8765705.618865] {5}[Hardware Error]:   physical_address: 0x0000002f54f10100
[8765705.618867] {5}[Hardware Error]:   node: 0 card: 2 module: 0 rank: 1 bank: 2 row: 32068 column: 0
[8765705.618868] {5}[Hardware Error]:   error_type: 2, single-bit ECC
[8765705.618889] EDAC sbridge MC0: HANDLING MCE MEMORY ERROR
[8765705.618890] EDAC sbridge MC0: CPU 0: Machine Check Event: 0 Bank 255: 940000000000009f
[8765705.618890] EDAC sbridge MC0: TSC 6cfe1466d55bd8
[8765705.618892] EDAC sbridge MC0: ADDR 2f54f10100
[8765705.618892] EDAC sbridge MC0: MISC 0
[8765705.618893] EDAC sbridge MC0: PROCESSOR 0:406f1 TIME 1556509774 SOCKET 0 APIC 0
[8765705.618912] EDAC MC0: 0 CE memory read error on CPU_SrcID#0_Ha#0_Chan#2_DIMM#0 (channel:2 slot:0 page:0x2f54f10 offset:0x100 grain:32 syndrome:0x0 -  area:DRAM err_code:0000:009f socket:0 ha:0 channel_mask:4 rank:1)
[8765713.976579] {6}[Hardware Error]: Hardware error from APEI Generic Hardware Error Source: 4
[8765713.976581] {6}[Hardware Error]: It has been corrected by h/w and requires no further action
[8765713.976582] {6}[Hardware Error]: event severity: corrected
[8765713.976583] {6}[Hardware Error]:  Error 0, type: corrected
[8765713.976584] {6}[Hardware Error]:  fru_text: A3
[8765713.976585] {6}[Hardware Error]:   section_type: memory error
[8765713.976586] {6}[Hardware Error]:   error_status: 0x0000000000000400
[8765713.976586] {6}[Hardware Error]:   physical_address: 0x0000002b44fd0080
[8765713.976589] {6}[Hardware Error]:   node: 0 card: 2 module: 0 rank: 1 bank: 2 row: 23623 column: 0
[8765713.976589] {6}[Hardware Error]:   error_type: 2, single-bit ECC
[8765713.976608] EDAC sbridge MC0: HANDLING MCE MEMORY ERROR
[8765713.976609] EDAC sbridge MC0: CPU 0: Machine Check Event: 0 Bank 255: 940000000000009f
[8765713.976610] EDAC sbridge MC0: TSC 6cfe1b364afc97
[8765713.976611] EDAC sbridge MC0: ADDR 2b44fd0080
[8765713.976611] EDAC sbridge MC0: MISC 0
[8765713.976613] EDAC sbridge MC0: PROCESSOR 0:406f1 TIME 1556509783 SOCKET 0 APIC 0
[8765713.976630] EDAC MC0: 0 CE memory read error on CPU_SrcID#0_Ha#0_Chan#2_DIMM#0 (channel:2 slot:0 page:0x2b44fd0 offset:0x80 grain:32 syndrome:0x0 -  area:DRAM err_code:0000:009f socket:0 ha:0 channel_mask:4 rank:1)
[8765720.214132] {7}[Hardware Error]: Hardware error from APEI Generic Hardware Error Source: 4
[8765720.214134] {7}[Hardware Error]: It has been corrected by h/w and requires no further action
[8765720.214136] {7}[Hardware Error]: event severity: corrected
[8765720.214138] {7}[Hardware Error]:  Error 0, type: corrected
[8765720.214139] {7}[Hardware Error]:  fru_text: A3
[8765720.214140] {7}[Hardware Error]:   section_type: memory error
[8765720.214141] {7}[Hardware Error]:   error_status: 0x0000000000000400
[8765720.214142] {7}[Hardware Error]:   physical_address: 0x0000002c63810140
[8765720.214145] {7}[Hardware Error]:   node: 0 card: 2 module: 0 rank: 1 bank: 2 row: 24120 column: 0
[8765720.214146] {7}[Hardware Error]:   error_type: 2, single-bit ECC
[8765720.214169] EDAC sbridge MC0: HANDLING MCE MEMORY ERROR
[8765720.214170] EDAC sbridge MC0: CPU 0: Machine Check Event: 0 Bank 255: 940000000000009f
[8765720.214175] EDAC sbridge MC0: TSC 6cfe204b79eae5
[8765720.214180] EDAC sbridge MC0: ADDR 2c63810140
[8765720.214183] EDAC sbridge MC0: MISC 0
[8765720.214186] EDAC sbridge MC0: PROCESSOR 0:406f1 TIME 1556509789 SOCKET 0 APIC 0
[8765720.214204] EDAC MC0: 0 CE memory read error on CPU_SrcID#0_Ha#0_Chan#2_DIMM#0 (channel:2 slot:0 page:0x2c63810 offset:0x140 grain:32 syndrome:0x0 -  area:DRAM err_code:0000:009f socket:0 ha:0 channel_mask:4 rank:1)
[8765743.843363] {8}[Hardware Error]: Hardware error from APEI Generic Hardware Error Source: 4
[8765743.843366] {8}[Hardware Error]: It has been corrected by h/w and requires no further action
[8765743.843368] {8}[Hardware Error]: event severity: corrected
[8765743.843370] {8}[Hardware Error]:  Error 0, type: corrected
[8765743.843372] {8}[Hardware Error]:  fru_text: A3
[8765743.843373] {8}[Hardware Error]:   section_type: memory error
[8765743.843375] {8}[Hardware Error]:   error_status: 0x0000000000000400
[8765743.843376] {8}[Hardware Error]:   physical_address: 0x0000002f61b900c0
[8765743.843380] {8}[Hardware Error]:   node: 0 card: 2 module: 0 rank: 1 bank: 2 row: 32270 column: 0
[8765743.843382] {8}[Hardware Error]:   error_type: 2, single-bit ECC
[8765743.843394] mce_notify_irq: 8 callbacks suppressed
[8765743.843395] mce: [Hardware Error]: Machine check events logged
[8765743.843413] EDAC sbridge MC0: HANDLING MCE MEMORY ERROR
[8765743.843417] EDAC sbridge MC0: CPU 0: Machine Check Event: 0 Bank 255: 940000000000009f
[8765743.843419] EDAC sbridge MC0: TSC 6cfe338ca4f6e8
[8765743.843421] EDAC sbridge MC0: ADDR 2f61b900c0
[8765743.843422] EDAC sbridge MC0: MISC 0
[8765743.843426] EDAC sbridge MC0: PROCESSOR 0:406f1 TIME 1556509813 SOCKET 0 APIC 0
[8765743.843453] EDAC MC0: 0 CE memory read error on CPU_SrcID#0_Ha#0_Chan#2_DIMM#0 (channel:2 slot:0 page:0x2f61b90 offset:0xc0 grain:32 syndrome:0x0 -  area:DRAM err_code:0000:009f socket:0 ha:0 channel_mask:4 rank:1)
[8765772.895625] {9}[Hardware Error]: Hardware error from APEI Generic Hardware Error Source: 4
[8765772.895628] {9}[Hardware Error]: It has been corrected by h/w and requires no further action
[8765772.895629] {9}[Hardware Error]: event severity: corrected
[8765772.895631] {9}[Hardware Error]:  Error 0, type: corrected
[8765772.895632] {9}[Hardware Error]:  fru_text: A3
[8765772.895634] {9}[Hardware Error]:   section_type: memory error
[8765772.895635] {9}[Hardware Error]:   error_status: 0x0000000000000400
[8765772.895636] {9}[Hardware Error]:   physical_address: 0x0000002b438d00c0
[8765772.895640] {9}[Hardware Error]:   node: 0 card: 2 module: 0 rank: 1 bank: 2 row: 23595 column: 0
[8765772.895641] {9}[Hardware Error]:   error_type: 2, single-bit ECC
[8765772.895652] mce: [Hardware Error]: Machine check events logged
[8765772.895664] EDAC sbridge MC0: HANDLING MCE MEMORY ERROR
[8765772.895667] EDAC sbridge MC0: CPU 0: Machine Check Event: 0 Bank 255: 940000000000009f
[8765772.895668] EDAC sbridge MC0: TSC 6cfe4b39148a1f
[8765772.895675] EDAC sbridge MC0: ADDR 2b438d00c0
[8765772.895680] EDAC sbridge MC0: MISC 0
[8765772.895687] EDAC sbridge MC0: PROCESSOR 0:406f1 TIME 1556509842 SOCKET 0 APIC 0
[8765772.895707] EDAC MC0: 0 CE memory read error on CPU_SrcID#0_Ha#0_Chan#2_DIMM#0 (channel:2 slot:0 page:0x2b438d0 offset:0xc0 grain:32 syndrome:0x0 -  area:DRAM err_code:0000:009f socket:0 ha:0 channel_mask:4 rank:1)
[8765824.674769] mce: [Hardware Error]: Machine check events logged
[8765845.214136] {10}[Hardware Error]: Hardware error from APEI Generic Hardware Error Source: 4
[8765845.214139] {10}[Hardware Error]: It has been corrected by h/w and requires no further action
[8765845.214141] {10}[Hardware Error]: event severity: corrected
[8765845.214144] {10}[Hardware Error]:  Error 0, type: corrected
[8765845.214145] {10}[Hardware Error]:  fru_text: A3
[8765845.214147] {10}[Hardware Error]:   section_type: memory error
[8765845.214149] {10}[Hardware Error]:   error_status: 0x0000000000000400
[8765845.214150] {10}[Hardware Error]:   physical_address: 0x0000002f7b1101c0
[8765845.214154] {10}[Hardware Error]:   node: 0 card: 2 module: 0 rank: 1 bank: 2 row: 32684 column: 0
[8765845.214156] {10}[Hardware Error]:   error_type: 2, single-bit ECC
[8765845.214179] EDAC sbridge MC0: HANDLING MCE MEMORY ERROR
[8765845.214182] EDAC sbridge MC0: CPU 0: Machine Check Event: 0 Bank 255: 940000000000009f
[8765845.214183] EDAC sbridge MC0: TSC 6cfe86270ec28e
[8765845.214185] EDAC sbridge MC0: ADDR 2f7b1101c0
[8765845.214187] EDAC sbridge MC0: MISC 0
[8765845.214189] EDAC sbridge MC0: PROCESSOR 0:406f1 TIME 1556509914 SOCKET 0 APIC 0
[8765845.214212] EDAC MC0: 0 CE memory read error on CPU_SrcID#0_Ha#0_Chan#2_DIMM#0 (channel:2 slot:0 page:0x2f7b110 offset:0x1c0 grain:32 syndrome:0x0 -  area:DRAM err_code:0000:009f socket:0 ha:0 channel_mask:4 rank:1)
[8765907.151119] {11}[Hardware Error]: Hardware error from APEI Generic Hardware Error Source: 4
[8765907.151123] {11}[Hardware Error]: It has been corrected by h/w and requires no further action
[8765907.151125] {11}[Hardware Error]: event severity: corrected
[8765907.151128] {11}[Hardware Error]:  Error 0, type: corrected
[8765907.151130] {11}[Hardware Error]:  fru_text: A3
[8765907.151133] {11}[Hardware Error]:   section_type: memory error
[8765907.151135] {11}[Hardware Error]:   error_status: 0x0000000000000400
[8765907.151138] {11}[Hardware Error]:   physical_address: 0x0000002c35f50140
[8765907.151144] {11}[Hardware Error]:   node: 0 card: 2 module: 0 rank: 1 bank: 2 row: 23389 column: 0
[8765907.151146] {11}[Hardware Error]:   error_type: 2, single-bit ECC
[8765907.151164] mce: [Hardware Error]: Machine check events logged
[8765907.151185] EDAC sbridge MC0: HANDLING MCE MEMORY ERROR
[8765907.151194] EDAC sbridge MC0: CPU 0: Machine Check Event: 0 Bank 255: 940000000000009f
[8765907.151202] EDAC sbridge MC0: TSC 6cfeb89f66b2fd
[8765907.151210] EDAC sbridge MC0: ADDR 2c35f50140
[8765907.151216] EDAC sbridge MC0: MISC 0
[8765907.151225] EDAC sbridge MC0: PROCESSOR 0:406f1 TIME 1556509976 SOCKET 0 APIC 0
[8765907.151259] EDAC MC0: 0 CE memory read error on CPU_SrcID#0_Ha#0_Chan#2_DIMM#0 (channel:2 slot:0 page:0x2c35f50 offset:0x140 grain:32 syndrome:0x0 -  area:DRAM err_code:0000:009f socket:0 ha:0 channel_mask:4 rank:1)
[8765921.522035] {12}[Hardware Error]: Hardware error from APEI Generic Hardware Error Source: 4
[8765921.522038] {12}[Hardware Error]: It has been corrected by h/w and requires no further action
[8765921.522039] {12}[Hardware Error]: event severity: corrected
[8765921.522041] {12}[Hardware Error]:  Error 0, type: corrected
[8765921.522042] {12}[Hardware Error]:  fru_text: A3
[8765921.522044] {12}[Hardware Error]:   section_type: memory error
[8765921.522045] {12}[Hardware Error]:   error_status: 0x0000000000000400
[8765921.522046] {12}[Hardware Error]:   physical_address: 0x0000002b40a90080
[8765921.522049] {12}[Hardware Error]:   node: 0 card: 2 module: 0 rank: 1 bank: 2 row: 23554 column: 0
[8765921.522050] {12}[Hardware Error]:   error_type: 2, single-bit ECC
[8765921.522062] mce: [Hardware Error]: Machine check events logged
[8765921.522079] EDAC sbridge MC0: HANDLING MCE MEMORY ERROR
[8765921.522081] EDAC sbridge MC0: CPU 0: Machine Check Event: 0 Bank 255: 940000000000009f
[8765921.522087] EDAC sbridge MC0: TSC 6cfec4553ce679
[8765921.522092] EDAC sbridge MC0: ADDR 2b40a90080
[8765921.522096] EDAC sbridge MC0: MISC 0
[8765921.522101] EDAC sbridge MC0: PROCESSOR 0:406f1 TIME 1556509990 SOCKET 0 APIC 0
[8765921.522123] EDAC MC0: 0 CE memory read error on CPU_SrcID#0_Ha#0_Chan#2_DIMM#0 (channel:2 slot:0 page:0x2b40a90 offset:0x80 grain:32 syndrome:0x0 -  area:DRAM err_code:0000:009f socket:0 ha:0 channel_mask:4 rank:1)
[8765921.758822] {13}[Hardware Error]: Hardware error from APEI Generic Hardware Error Source: 4
[8765921.758824] {13}[Hardware Error]: It has been corrected by h/w and requires no further action
[8765921.758825] {13}[Hardware Error]: event severity: corrected
[8765921.758827] {13}[Hardware Error]:  Error 0, type: corrected
[8765921.758828] {13}[Hardware Error]:  fru_text: A3
[8765921.758829] {13}[Hardware Error]:   section_type: memory error
[8765921.758830] {13}[Hardware Error]:   error_status: 0x0000000000000400
[8765921.758831] {13}[Hardware Error]:   physical_address: 0x0000002c583d0080
[8765921.758834] {13}[Hardware Error]:   node: 0 card: 2 module: 0 rank: 1 bank: 2 row: 23959 column: 0
[8765921.758835] {13}[Hardware Error]:   error_type: 2, single-bit ECC
[8765921.758858] EDAC sbridge MC0: HANDLING MCE MEMORY ERROR
[8765921.758859] EDAC sbridge MC0: CPU 0: Machine Check Event: 0 Bank 255: 940000000000009f
[8765921.758862] EDAC sbridge MC0: TSC 6cfec486a1d4d7
[8765921.758868] EDAC sbridge MC0: ADDR 2c583d0080
[8765921.758871] EDAC sbridge MC0: MISC 0
[8765921.758875] EDAC sbridge MC0: PROCESSOR 0:406f1 TIME 1556509990 SOCKET 0 APIC 0
[8765921.758893] EDAC MC0: 0 CE memory read error on CPU_SrcID#0_Ha#0_Chan#2_DIMM#0 (channel:2 slot:0 page:0x2c583d0 offset:0x80 grain:32 syndrome:0x0 -  area:DRAM err_code:0000:009f socket:0 ha:0 channel_mask:4 rank:1)
[8765988.521314] mce_notify_irq: 1 callbacks suppressed
[8765988.521332] mce: [Hardware Error]: Machine check events logged

Looks like:

[8765665.177550] EDAC MC0: 0 CE memory read error on CPU_SrcID#0_Ha#0_Chan#2_DIMM#0 (channel:2 slot:0 page:0x3007d50 offset:0x140 grain:32 syndrome:0x0 -  area:DRAM err_code:0000:009f socket:0 ha:0 channel_mask:4 rank:1)
[8765666.600279] {2}[Hardware Error]:   node: 0 card: 2 module: 0 rank: 1 bank: 2 row: 32336 column: 0

There is also this message:

[8765666.600256] {2}[Hardware Error]: Hardware error from APEI Generic Hardware Error Source: 4
[8765666.600259] {2}[Hardware Error]: It has been corrected by h/w and requires no further action
[8765666.600262] {2}[Hardware Error]: event severity: corrected
[8765666.600264] {2}[Hardware Error]:  Error 0, type: corrected

There is also nothing on HW logs, so I guess nothing needs to be done for now?
The icinga alert has not yet cleared up.

@Cmjohnson if this host would need a maintenance window, it would need to be coordinate with the service owners @elukey and @Ottomata

Event Timeline

Restricted Application added projects: Operations, Analytics. · View Herald TranscriptApr 29 2019, 5:10 AM
Restricted Application added a subscriber: Liuxinyu970226. · View Herald Transcript
Marostegui triaged this task as Normal priority.

I've acked the alert for now.

Thanks a lot @Marostegui! We can shutdown the host without any problem, it only needs a ~10m heads up to properly stop eventlogging data and mariadb (and replication).

Thanks - not sure how to proceed as the dmesg entries show that the issue is fixed but icinga is still reporting errors (I just forced the recheck).
Let's see what @Cmjohnson suggests

@elukey @Ottomata what do you guys want to do with this?

elukey added a comment.May 6 2019, 4:45 PM

@Marostegui sorry I was under the impression that we'd have needed to wait for a feedback from Chris/Rob about how to proceed. Is there anything pending on the Analytics side?

@elukey sorry, I realised that I didn't sent the first sentence:
"The errors corrected themselves and Icinga is now all green, which is something we have seen in the past with correctable errors". Hence, I am not sure if we should close this task for now (and re-open if it happens again) or you prefer to leave it open?

I now have h/w log entries. I will need the server to be taken offline so I can relocate the DIMM and check to see if the error follows. Unfortunately there are steps involved that need to be done to get Dell to replace.

Record: 84
Date/Time: 04/29/2019 09:35:48
Source: system
Severity: Non-Critical

Description: Correctable memory error rate exceeded for DIMM_A3.

Record: 85
Date/Time: 04/29/2019 09:35:51
Source: system
Severity: Critical
Description: Correctable memory error rate exceeded for DIMM_A3.

@elukey can you coordinate with Chris? ^

[18:50:55]  <cmjohnson1>	marostegui i am confused over db1007...is there an issue or not an issue?  There is a h/w log entry from 4/29 but nothing since 
[18:52:01]  <marostegui>	cmjohnson1: yeah, that is my confusion too, it was a correctable error, which got corrected itself
[18:52:08]  <marostegui>	so not sure if we should go for the dimm exchange or not
[18:52:13]  <marostegui>	cmjohnson1: what do you advise?
[18:52:50]  <cmjohnson1>	let's move the dimm from A3 to B3 and clear the log...if the error returns I will know what to do next. 
[18:53:01]  <marostegui>	sounds good
[18:53:14]  <marostegui>	I will paste that on the ticket so luca can coordinate (as he needs to stop the service)

cc @elukey

elukey added a comment.May 9 2019, 6:19 AM

@Cmjohnson I'd need a heads up of ~15 mins before the maintenance to shutdown the host properly, but we can do it anytime!

Marostegui closed this task as Resolved.Mon, Jun 17, 5:04 AM

Closing this for now as it self-recovered and never showed up again.