Reported on icinga:
[06:21:11] <+icinga-wm> PROBLEM - EDAC syslog messages on db1107 is CRITICAL: 16.36 ge 4 https://grafana.wikimedia.org/dashboard/db/host-overview?orgId=1&var-server=db1107&var-datasource=eqiad+prometheus/ops
[8765665.177485] {1}[Hardware Error]: Hardware error from APEI Generic Hardware Error Source: 4 [8765665.177487] {1}[Hardware Error]: It has been corrected by h/w and requires no further action [8765665.177489] {1}[Hardware Error]: event severity: corrected [8765665.177491] {1}[Hardware Error]: Error 0, type: corrected [8765665.177492] {1}[Hardware Error]: fru_text: A3 [8765665.177494] {1}[Hardware Error]: section_type: memory error [8765665.177495] {1}[Hardware Error]: error_status: 0x0000000000000400 [8765665.177496] {1}[Hardware Error]: physical_address: 0x0000003007d50140 [8765665.177500] {1}[Hardware Error]: node: 0 card: 2 module: 0 rank: 1 bank: 2 row: 30845 column: 0 [8765665.177502] {1}[Hardware Error]: error_type: 2, single-bit ECC [8765665.177521] EDAC sbridge MC0: HANDLING MCE MEMORY ERROR [8765665.177523] EDAC sbridge MC0: CPU 0: Machine Check Event: 0 Bank 255: 940000000000009f [8765665.177525] EDAC sbridge MC0: TSC 6cfdf37293c04f [8765665.177526] EDAC sbridge MC0: ADDR 3007d50140 [8765665.177527] EDAC sbridge MC0: MISC 0 [8765665.177530] EDAC sbridge MC0: PROCESSOR 0:406f1 TIME 1556509734 SOCKET 0 APIC 0 [8765665.177550] EDAC MC0: 0 CE memory read error on CPU_SrcID#0_Ha#0_Chan#2_DIMM#0 (channel:2 slot:0 page:0x3007d50 offset:0x140 grain:32 syndrome:0x0 - area:DRAM err_code:0000:009f socket:0 ha:0 channel_mask:4 rank:1) [8765666.600256] {2}[Hardware Error]: Hardware error from APEI Generic Hardware Error Source: 4 [8765666.600259] {2}[Hardware Error]: It has been corrected by h/w and requires no further action [8765666.600262] {2}[Hardware Error]: event severity: corrected [8765666.600264] {2}[Hardware Error]: Error 0, type: corrected [8765666.600266] {2}[Hardware Error]: fru_text: A3 [8765666.600268] {2}[Hardware Error]: section_type: memory error [8765666.600271] {2}[Hardware Error]: error_status: 0x0000000000000400 [8765666.600273] {2}[Hardware Error]: physical_address: 0x0000003064e10180 [8765666.600279] {2}[Hardware Error]: node: 0 card: 2 module: 0 rank: 1 bank: 2 row: 32336 column: 0 [8765666.600281] {2}[Hardware Error]: error_type: 2, single-bit ECC [8765666.600295] mce: [Hardware Error]: Machine check events logged [8765666.600310] EDAC sbridge MC0: HANDLING MCE MEMORY ERROR [8765666.600314] EDAC sbridge MC0: CPU 0: Machine Check Event: 0 Bank 255: 940000000000009f [8765666.600316] EDAC sbridge MC0: TSC 6cfdf49b6021f1 [8765666.600319] EDAC sbridge MC0: ADDR 3064e10180 [8765666.600320] EDAC sbridge MC0: MISC 0 [8765666.600324] EDAC sbridge MC0: PROCESSOR 0:406f1 TIME 1556509735 SOCKET 0 APIC 0 [8765666.600359] EDAC MC0: 0 CE memory read error on CPU_SrcID#0_Ha#0_Chan#2_DIMM#0 (channel:2 slot:0 page:0x3064e10 offset:0x180 grain:32 syndrome:0x0 - area:DRAM err_code:0000:009f socket:0 ha:0 channel_mask:4 rank:1) [8765667.162473] mce: [Hardware Error]: Machine check events logged [8765667.162486] EDAC sbridge MC0: HANDLING MCE MEMORY ERROR [8765667.162490] EDAC sbridge MC0: CPU 0: Machine Check Event: 0 Bank 255: 940000000000009f [8765667.162492] EDAC sbridge MC0: TSC 6cfdf510a618ee [8765667.162494] EDAC sbridge MC0: ADDR 2c68250100 [8765667.162495] EDAC sbridge MC0: MISC 0 [8765667.162498] EDAC sbridge MC0: PROCESSOR 0:406f1 TIME 1556509736 SOCKET 0 APIC 0 [8765667.162517] EDAC MC0: 0 CE memory read error on CPU_SrcID#0_Ha#0_Chan#2_DIMM#0 (channel:2 slot:0 page:0x2c68250 offset:0x100 grain:32 syndrome:0x0 - area:DRAM err_code:0000:009f socket:0 ha:0 channel_mask:4 rank:1) [8765668.471777] EDAC sbridge MC0: HANDLING MCE MEMORY ERROR [8765668.471781] EDAC sbridge MC0: CPU 0: Machine Check Event: 0 Bank 255: 940000000000009f [8765668.471783] EDAC sbridge MC0: TSC 6cfdf621c6534a [8765668.471784] EDAC sbridge MC0: ADDR 2b1d6501c0 [8765668.471785] EDAC sbridge MC0: MISC 0 [8765668.471787] EDAC sbridge MC0: PROCESSOR 0:406f1 TIME 1556509737 SOCKET 0 APIC 0 [8765668.471804] EDAC MC0: 0 CE memory read error on CPU_SrcID#0_Ha#0_Chan#2_DIMM#0 (channel:2 slot:0 page:0x2b1d650 offset:0x1c0 grain:32 syndrome:0x0 - area:DRAM err_code:0000:009f socket:0 ha:0 channel_mask:4 rank:1) [8765669.021046] EDAC sbridge MC0: HANDLING MCE MEMORY ERROR [8765669.021051] EDAC sbridge MC0: CPU 0: Machine Check Event: 0 Bank 255: 940000000000009f [8765669.021053] EDAC sbridge MC0: TSC 6cfdf6945ac12f [8765669.021054] EDAC sbridge MC0: ADDR 2b6d7d0040 [8765669.021056] EDAC sbridge MC0: MISC 0 [8765669.021058] EDAC sbridge MC0: PROCESSOR 0:406f1 TIME 1556509738 SOCKET 0 APIC 0 [8765669.021078] EDAC MC0: 0 CE memory read error on CPU_SrcID#0_Ha#0_Chan#2_DIMM#0 (channel:2 slot:0 page:0x2b6d7d0 offset:0x40 grain:32 syndrome:0x0 - area:DRAM err_code:0000:009f socket:0 ha:0 channel_mask:4 rank:1) [8765671.646384] ghes_print_estatus: 3 callbacks suppressed [8765671.646387] {3}[Hardware Error]: Hardware error from APEI Generic Hardware Error Source: 4 [8765671.646389] {3}[Hardware Error]: It has been corrected by h/w and requires no further action [8765671.646390] {3}[Hardware Error]: event severity: corrected [8765671.646393] {3}[Hardware Error]: Error 0, type: corrected [8765671.646395] {3}[Hardware Error]: fru_text: A3 [8765671.646396] {3}[Hardware Error]: section_type: memory error [8765671.646398] {3}[Hardware Error]: error_status: 0x0000000000000400 [8765671.646399] {3}[Hardware Error]: physical_address: 0x000000303a1d0080 [8765671.646403] {3}[Hardware Error]: node: 0 card: 2 module: 0 rank: 1 bank: 2 row: 31671 column: 0 [8765671.646404] {3}[Hardware Error]: error_type: 2, single-bit ECC [8765671.646431] EDAC sbridge MC0: HANDLING MCE MEMORY ERROR [8765671.646433] EDAC sbridge MC0: CPU 0: Machine Check Event: 0 Bank 255: 940000000000009f [8765671.646435] EDAC sbridge MC0: TSC 6cfdf8b8052f87 [8765671.646437] EDAC sbridge MC0: ADDR 303a1d0080 [8765671.646442] EDAC sbridge MC0: MISC 0 [8765671.646448] EDAC sbridge MC0: PROCESSOR 0:406f1 TIME 1556509740 SOCKET 0 APIC 0 [8765671.646469] EDAC MC0: 0 CE memory read error on CPU_SrcID#0_Ha#0_Chan#2_DIMM#0 (channel:2 slot:0 page:0x303a1d0 offset:0x80 grain:32 syndrome:0x0 - area:DRAM err_code:0000:009f socket:0 ha:0 channel_mask:4 rank:1) [8765672.579945] {4}[Hardware Error]: Hardware error from APEI Generic Hardware Error Source: 4 [8765672.579948] {4}[Hardware Error]: It has been corrected by h/w and requires no further action [8765672.579950] {4}[Hardware Error]: event severity: corrected [8765672.579952] {4}[Hardware Error]: Error 0, type: corrected [8765672.579953] {4}[Hardware Error]: fru_text: A3 [8765672.579955] {4}[Hardware Error]: section_type: memory error [8765672.579956] {4}[Hardware Error]: error_status: 0x0000000000000400 [8765672.579958] {4}[Hardware Error]: physical_address: 0x0000002c7e450100 [8765672.579962] {4}[Hardware Error]: node: 0 card: 2 module: 0 rank: 1 bank: 2 row: 24561 column: 0 [8765672.579963] {4}[Hardware Error]: error_type: 2, single-bit ECC [8765672.579989] EDAC sbridge MC0: HANDLING MCE MEMORY ERROR [8765672.579994] EDAC sbridge MC0: CPU 0: Machine Check Event: 0 Bank 255: 940000000000009f [8765672.580001] EDAC sbridge MC0: TSC 6cfdf97ac3d24a [8765672.580005] EDAC sbridge MC0: ADDR 2c7e450100 [8765672.580011] EDAC sbridge MC0: MISC 0 [8765672.580016] EDAC sbridge MC0: PROCESSOR 0:406f1 TIME 1556509741 SOCKET 0 APIC 0 [8765672.580038] EDAC MC0: 0 CE memory read error on CPU_SrcID#0_Ha#0_Chan#2_DIMM#0 (channel:2 slot:0 page:0x2c7e450 offset:0x100 grain:32 syndrome:0x0 - area:DRAM err_code:0000:009f socket:0 ha:0 channel_mask:4 rank:1) [8765675.862421] EDAC sbridge MC0: HANDLING MCE MEMORY ERROR [8765675.862426] EDAC sbridge MC0: CPU 0: Machine Check Event: 0 Bank 255: 940000000000009f [8765675.862428] EDAC sbridge MC0: TSC 6cfdfc277f1f35 [8765675.862430] EDAC sbridge MC0: ADDR 2f6a090080 [8765675.862432] EDAC sbridge MC0: MISC 0 [8765675.862435] EDAC sbridge MC0: PROCESSOR 0:406f1 TIME 1556509745 SOCKET 0 APIC 0 [8765675.862459] EDAC MC0: 0 CE memory read error on CPU_SrcID#0_Ha#0_Chan#2_DIMM#0 (channel:2 slot:0 page:0x2f6a090 offset:0x80 grain:32 syndrome:0x0 - area:DRAM err_code:0000:009f socket:0 ha:0 channel_mask:4 rank:1) [8765705.618856] ghes_print_estatus: 1 callbacks suppressed [8765705.618857] {5}[Hardware Error]: Hardware error from APEI Generic Hardware Error Source: 4 [8765705.618859] {5}[Hardware Error]: It has been corrected by h/w and requires no further action [8765705.618860] {5}[Hardware Error]: event severity: corrected [8765705.618862] {5}[Hardware Error]: Error 0, type: corrected [8765705.618863] {5}[Hardware Error]: fru_text: A3 [8765705.618863] {5}[Hardware Error]: section_type: memory error [8765705.618864] {5}[Hardware Error]: error_status: 0x0000000000000400 [8765705.618865] {5}[Hardware Error]: physical_address: 0x0000002f54f10100 [8765705.618867] {5}[Hardware Error]: node: 0 card: 2 module: 0 rank: 1 bank: 2 row: 32068 column: 0 [8765705.618868] {5}[Hardware Error]: error_type: 2, single-bit ECC [8765705.618889] EDAC sbridge MC0: HANDLING MCE MEMORY ERROR [8765705.618890] EDAC sbridge MC0: CPU 0: Machine Check Event: 0 Bank 255: 940000000000009f [8765705.618890] EDAC sbridge MC0: TSC 6cfe1466d55bd8 [8765705.618892] EDAC sbridge MC0: ADDR 2f54f10100 [8765705.618892] EDAC sbridge MC0: MISC 0 [8765705.618893] EDAC sbridge MC0: PROCESSOR 0:406f1 TIME 1556509774 SOCKET 0 APIC 0 [8765705.618912] EDAC MC0: 0 CE memory read error on CPU_SrcID#0_Ha#0_Chan#2_DIMM#0 (channel:2 slot:0 page:0x2f54f10 offset:0x100 grain:32 syndrome:0x0 - area:DRAM err_code:0000:009f socket:0 ha:0 channel_mask:4 rank:1) [8765713.976579] {6}[Hardware Error]: Hardware error from APEI Generic Hardware Error Source: 4 [8765713.976581] {6}[Hardware Error]: It has been corrected by h/w and requires no further action [8765713.976582] {6}[Hardware Error]: event severity: corrected [8765713.976583] {6}[Hardware Error]: Error 0, type: corrected [8765713.976584] {6}[Hardware Error]: fru_text: A3 [8765713.976585] {6}[Hardware Error]: section_type: memory error [8765713.976586] {6}[Hardware Error]: error_status: 0x0000000000000400 [8765713.976586] {6}[Hardware Error]: physical_address: 0x0000002b44fd0080 [8765713.976589] {6}[Hardware Error]: node: 0 card: 2 module: 0 rank: 1 bank: 2 row: 23623 column: 0 [8765713.976589] {6}[Hardware Error]: error_type: 2, single-bit ECC [8765713.976608] EDAC sbridge MC0: HANDLING MCE MEMORY ERROR [8765713.976609] EDAC sbridge MC0: CPU 0: Machine Check Event: 0 Bank 255: 940000000000009f [8765713.976610] EDAC sbridge MC0: TSC 6cfe1b364afc97 [8765713.976611] EDAC sbridge MC0: ADDR 2b44fd0080 [8765713.976611] EDAC sbridge MC0: MISC 0 [8765713.976613] EDAC sbridge MC0: PROCESSOR 0:406f1 TIME 1556509783 SOCKET 0 APIC 0 [8765713.976630] EDAC MC0: 0 CE memory read error on CPU_SrcID#0_Ha#0_Chan#2_DIMM#0 (channel:2 slot:0 page:0x2b44fd0 offset:0x80 grain:32 syndrome:0x0 - area:DRAM err_code:0000:009f socket:0 ha:0 channel_mask:4 rank:1) [8765720.214132] {7}[Hardware Error]: Hardware error from APEI Generic Hardware Error Source: 4 [8765720.214134] {7}[Hardware Error]: It has been corrected by h/w and requires no further action [8765720.214136] {7}[Hardware Error]: event severity: corrected [8765720.214138] {7}[Hardware Error]: Error 0, type: corrected [8765720.214139] {7}[Hardware Error]: fru_text: A3 [8765720.214140] {7}[Hardware Error]: section_type: memory error [8765720.214141] {7}[Hardware Error]: error_status: 0x0000000000000400 [8765720.214142] {7}[Hardware Error]: physical_address: 0x0000002c63810140 [8765720.214145] {7}[Hardware Error]: node: 0 card: 2 module: 0 rank: 1 bank: 2 row: 24120 column: 0 [8765720.214146] {7}[Hardware Error]: error_type: 2, single-bit ECC [8765720.214169] EDAC sbridge MC0: HANDLING MCE MEMORY ERROR [8765720.214170] EDAC sbridge MC0: CPU 0: Machine Check Event: 0 Bank 255: 940000000000009f [8765720.214175] EDAC sbridge MC0: TSC 6cfe204b79eae5 [8765720.214180] EDAC sbridge MC0: ADDR 2c63810140 [8765720.214183] EDAC sbridge MC0: MISC 0 [8765720.214186] EDAC sbridge MC0: PROCESSOR 0:406f1 TIME 1556509789 SOCKET 0 APIC 0 [8765720.214204] EDAC MC0: 0 CE memory read error on CPU_SrcID#0_Ha#0_Chan#2_DIMM#0 (channel:2 slot:0 page:0x2c63810 offset:0x140 grain:32 syndrome:0x0 - area:DRAM err_code:0000:009f socket:0 ha:0 channel_mask:4 rank:1) [8765743.843363] {8}[Hardware Error]: Hardware error from APEI Generic Hardware Error Source: 4 [8765743.843366] {8}[Hardware Error]: It has been corrected by h/w and requires no further action [8765743.843368] {8}[Hardware Error]: event severity: corrected [8765743.843370] {8}[Hardware Error]: Error 0, type: corrected [8765743.843372] {8}[Hardware Error]: fru_text: A3 [8765743.843373] {8}[Hardware Error]: section_type: memory error [8765743.843375] {8}[Hardware Error]: error_status: 0x0000000000000400 [8765743.843376] {8}[Hardware Error]: physical_address: 0x0000002f61b900c0 [8765743.843380] {8}[Hardware Error]: node: 0 card: 2 module: 0 rank: 1 bank: 2 row: 32270 column: 0 [8765743.843382] {8}[Hardware Error]: error_type: 2, single-bit ECC [8765743.843394] mce_notify_irq: 8 callbacks suppressed [8765743.843395] mce: [Hardware Error]: Machine check events logged [8765743.843413] EDAC sbridge MC0: HANDLING MCE MEMORY ERROR [8765743.843417] EDAC sbridge MC0: CPU 0: Machine Check Event: 0 Bank 255: 940000000000009f [8765743.843419] EDAC sbridge MC0: TSC 6cfe338ca4f6e8 [8765743.843421] EDAC sbridge MC0: ADDR 2f61b900c0 [8765743.843422] EDAC sbridge MC0: MISC 0 [8765743.843426] EDAC sbridge MC0: PROCESSOR 0:406f1 TIME 1556509813 SOCKET 0 APIC 0 [8765743.843453] EDAC MC0: 0 CE memory read error on CPU_SrcID#0_Ha#0_Chan#2_DIMM#0 (channel:2 slot:0 page:0x2f61b90 offset:0xc0 grain:32 syndrome:0x0 - area:DRAM err_code:0000:009f socket:0 ha:0 channel_mask:4 rank:1) [8765772.895625] {9}[Hardware Error]: Hardware error from APEI Generic Hardware Error Source: 4 [8765772.895628] {9}[Hardware Error]: It has been corrected by h/w and requires no further action [8765772.895629] {9}[Hardware Error]: event severity: corrected [8765772.895631] {9}[Hardware Error]: Error 0, type: corrected [8765772.895632] {9}[Hardware Error]: fru_text: A3 [8765772.895634] {9}[Hardware Error]: section_type: memory error [8765772.895635] {9}[Hardware Error]: error_status: 0x0000000000000400 [8765772.895636] {9}[Hardware Error]: physical_address: 0x0000002b438d00c0 [8765772.895640] {9}[Hardware Error]: node: 0 card: 2 module: 0 rank: 1 bank: 2 row: 23595 column: 0 [8765772.895641] {9}[Hardware Error]: error_type: 2, single-bit ECC [8765772.895652] mce: [Hardware Error]: Machine check events logged [8765772.895664] EDAC sbridge MC0: HANDLING MCE MEMORY ERROR [8765772.895667] EDAC sbridge MC0: CPU 0: Machine Check Event: 0 Bank 255: 940000000000009f [8765772.895668] EDAC sbridge MC0: TSC 6cfe4b39148a1f [8765772.895675] EDAC sbridge MC0: ADDR 2b438d00c0 [8765772.895680] EDAC sbridge MC0: MISC 0 [8765772.895687] EDAC sbridge MC0: PROCESSOR 0:406f1 TIME 1556509842 SOCKET 0 APIC 0 [8765772.895707] EDAC MC0: 0 CE memory read error on CPU_SrcID#0_Ha#0_Chan#2_DIMM#0 (channel:2 slot:0 page:0x2b438d0 offset:0xc0 grain:32 syndrome:0x0 - area:DRAM err_code:0000:009f socket:0 ha:0 channel_mask:4 rank:1) [8765824.674769] mce: [Hardware Error]: Machine check events logged [8765845.214136] {10}[Hardware Error]: Hardware error from APEI Generic Hardware Error Source: 4 [8765845.214139] {10}[Hardware Error]: It has been corrected by h/w and requires no further action [8765845.214141] {10}[Hardware Error]: event severity: corrected [8765845.214144] {10}[Hardware Error]: Error 0, type: corrected [8765845.214145] {10}[Hardware Error]: fru_text: A3 [8765845.214147] {10}[Hardware Error]: section_type: memory error [8765845.214149] {10}[Hardware Error]: error_status: 0x0000000000000400 [8765845.214150] {10}[Hardware Error]: physical_address: 0x0000002f7b1101c0 [8765845.214154] {10}[Hardware Error]: node: 0 card: 2 module: 0 rank: 1 bank: 2 row: 32684 column: 0 [8765845.214156] {10}[Hardware Error]: error_type: 2, single-bit ECC [8765845.214179] EDAC sbridge MC0: HANDLING MCE MEMORY ERROR [8765845.214182] EDAC sbridge MC0: CPU 0: Machine Check Event: 0 Bank 255: 940000000000009f [8765845.214183] EDAC sbridge MC0: TSC 6cfe86270ec28e [8765845.214185] EDAC sbridge MC0: ADDR 2f7b1101c0 [8765845.214187] EDAC sbridge MC0: MISC 0 [8765845.214189] EDAC sbridge MC0: PROCESSOR 0:406f1 TIME 1556509914 SOCKET 0 APIC 0 [8765845.214212] EDAC MC0: 0 CE memory read error on CPU_SrcID#0_Ha#0_Chan#2_DIMM#0 (channel:2 slot:0 page:0x2f7b110 offset:0x1c0 grain:32 syndrome:0x0 - area:DRAM err_code:0000:009f socket:0 ha:0 channel_mask:4 rank:1) [8765907.151119] {11}[Hardware Error]: Hardware error from APEI Generic Hardware Error Source: 4 [8765907.151123] {11}[Hardware Error]: It has been corrected by h/w and requires no further action [8765907.151125] {11}[Hardware Error]: event severity: corrected [8765907.151128] {11}[Hardware Error]: Error 0, type: corrected [8765907.151130] {11}[Hardware Error]: fru_text: A3 [8765907.151133] {11}[Hardware Error]: section_type: memory error [8765907.151135] {11}[Hardware Error]: error_status: 0x0000000000000400 [8765907.151138] {11}[Hardware Error]: physical_address: 0x0000002c35f50140 [8765907.151144] {11}[Hardware Error]: node: 0 card: 2 module: 0 rank: 1 bank: 2 row: 23389 column: 0 [8765907.151146] {11}[Hardware Error]: error_type: 2, single-bit ECC [8765907.151164] mce: [Hardware Error]: Machine check events logged [8765907.151185] EDAC sbridge MC0: HANDLING MCE MEMORY ERROR [8765907.151194] EDAC sbridge MC0: CPU 0: Machine Check Event: 0 Bank 255: 940000000000009f [8765907.151202] EDAC sbridge MC0: TSC 6cfeb89f66b2fd [8765907.151210] EDAC sbridge MC0: ADDR 2c35f50140 [8765907.151216] EDAC sbridge MC0: MISC 0 [8765907.151225] EDAC sbridge MC0: PROCESSOR 0:406f1 TIME 1556509976 SOCKET 0 APIC 0 [8765907.151259] EDAC MC0: 0 CE memory read error on CPU_SrcID#0_Ha#0_Chan#2_DIMM#0 (channel:2 slot:0 page:0x2c35f50 offset:0x140 grain:32 syndrome:0x0 - area:DRAM err_code:0000:009f socket:0 ha:0 channel_mask:4 rank:1) [8765921.522035] {12}[Hardware Error]: Hardware error from APEI Generic Hardware Error Source: 4 [8765921.522038] {12}[Hardware Error]: It has been corrected by h/w and requires no further action [8765921.522039] {12}[Hardware Error]: event severity: corrected [8765921.522041] {12}[Hardware Error]: Error 0, type: corrected [8765921.522042] {12}[Hardware Error]: fru_text: A3 [8765921.522044] {12}[Hardware Error]: section_type: memory error [8765921.522045] {12}[Hardware Error]: error_status: 0x0000000000000400 [8765921.522046] {12}[Hardware Error]: physical_address: 0x0000002b40a90080 [8765921.522049] {12}[Hardware Error]: node: 0 card: 2 module: 0 rank: 1 bank: 2 row: 23554 column: 0 [8765921.522050] {12}[Hardware Error]: error_type: 2, single-bit ECC [8765921.522062] mce: [Hardware Error]: Machine check events logged [8765921.522079] EDAC sbridge MC0: HANDLING MCE MEMORY ERROR [8765921.522081] EDAC sbridge MC0: CPU 0: Machine Check Event: 0 Bank 255: 940000000000009f [8765921.522087] EDAC sbridge MC0: TSC 6cfec4553ce679 [8765921.522092] EDAC sbridge MC0: ADDR 2b40a90080 [8765921.522096] EDAC sbridge MC0: MISC 0 [8765921.522101] EDAC sbridge MC0: PROCESSOR 0:406f1 TIME 1556509990 SOCKET 0 APIC 0 [8765921.522123] EDAC MC0: 0 CE memory read error on CPU_SrcID#0_Ha#0_Chan#2_DIMM#0 (channel:2 slot:0 page:0x2b40a90 offset:0x80 grain:32 syndrome:0x0 - area:DRAM err_code:0000:009f socket:0 ha:0 channel_mask:4 rank:1) [8765921.758822] {13}[Hardware Error]: Hardware error from APEI Generic Hardware Error Source: 4 [8765921.758824] {13}[Hardware Error]: It has been corrected by h/w and requires no further action [8765921.758825] {13}[Hardware Error]: event severity: corrected [8765921.758827] {13}[Hardware Error]: Error 0, type: corrected [8765921.758828] {13}[Hardware Error]: fru_text: A3 [8765921.758829] {13}[Hardware Error]: section_type: memory error [8765921.758830] {13}[Hardware Error]: error_status: 0x0000000000000400 [8765921.758831] {13}[Hardware Error]: physical_address: 0x0000002c583d0080 [8765921.758834] {13}[Hardware Error]: node: 0 card: 2 module: 0 rank: 1 bank: 2 row: 23959 column: 0 [8765921.758835] {13}[Hardware Error]: error_type: 2, single-bit ECC [8765921.758858] EDAC sbridge MC0: HANDLING MCE MEMORY ERROR [8765921.758859] EDAC sbridge MC0: CPU 0: Machine Check Event: 0 Bank 255: 940000000000009f [8765921.758862] EDAC sbridge MC0: TSC 6cfec486a1d4d7 [8765921.758868] EDAC sbridge MC0: ADDR 2c583d0080 [8765921.758871] EDAC sbridge MC0: MISC 0 [8765921.758875] EDAC sbridge MC0: PROCESSOR 0:406f1 TIME 1556509990 SOCKET 0 APIC 0 [8765921.758893] EDAC MC0: 0 CE memory read error on CPU_SrcID#0_Ha#0_Chan#2_DIMM#0 (channel:2 slot:0 page:0x2c583d0 offset:0x80 grain:32 syndrome:0x0 - area:DRAM err_code:0000:009f socket:0 ha:0 channel_mask:4 rank:1) [8765988.521314] mce_notify_irq: 1 callbacks suppressed [8765988.521332] mce: [Hardware Error]: Machine check events logged
Looks like:
[8765665.177550] EDAC MC0: 0 CE memory read error on CPU_SrcID#0_Ha#0_Chan#2_DIMM#0 (channel:2 slot:0 page:0x3007d50 offset:0x140 grain:32 syndrome:0x0 - area:DRAM err_code:0000:009f socket:0 ha:0 channel_mask:4 rank:1) [8765666.600279] {2}[Hardware Error]: node: 0 card: 2 module: 0 rank: 1 bank: 2 row: 32336 column: 0
There is also this message:
[8765666.600256] {2}[Hardware Error]: Hardware error from APEI Generic Hardware Error Source: 4 [8765666.600259] {2}[Hardware Error]: It has been corrected by h/w and requires no further action [8765666.600262] {2}[Hardware Error]: event severity: corrected [8765666.600264] {2}[Hardware Error]: Error 0, type: corrected
There is also nothing on HW logs, so I guess nothing needs to be done for now?
The icinga alert has not yet cleared up.
@Cmjohnson if this host would need a maintenance window, it would need to be coordinate with the service owners @elukey and @Ottomata