Page MenuHomePhabricator

db1107 (eventlogging db master) possibly memory issues
Closed, ResolvedPublic

Description

Reported on icinga:

[06:21:11]  <+icinga-wm>	PROBLEM - EDAC syslog messages on db1107 is CRITICAL: 16.36 ge 4 https://grafana.wikimedia.org/dashboard/db/host-overview?orgId=1&var-server=db1107&var-datasource=eqiad+prometheus/ops
[8765665.177485] {1}[Hardware Error]: Hardware error from APEI Generic Hardware Error Source: 4
[8765665.177487] {1}[Hardware Error]: It has been corrected by h/w and requires no further action
[8765665.177489] {1}[Hardware Error]: event severity: corrected
[8765665.177491] {1}[Hardware Error]:  Error 0, type: corrected
[8765665.177492] {1}[Hardware Error]:  fru_text: A3
[8765665.177494] {1}[Hardware Error]:   section_type: memory error
[8765665.177495] {1}[Hardware Error]:   error_status: 0x0000000000000400
[8765665.177496] {1}[Hardware Error]:   physical_address: 0x0000003007d50140
[8765665.177500] {1}[Hardware Error]:   node: 0 card: 2 module: 0 rank: 1 bank: 2 row: 30845 column: 0
[8765665.177502] {1}[Hardware Error]:   error_type: 2, single-bit ECC
[8765665.177521] EDAC sbridge MC0: HANDLING MCE MEMORY ERROR
[8765665.177523] EDAC sbridge MC0: CPU 0: Machine Check Event: 0 Bank 255: 940000000000009f
[8765665.177525] EDAC sbridge MC0: TSC 6cfdf37293c04f
[8765665.177526] EDAC sbridge MC0: ADDR 3007d50140
[8765665.177527] EDAC sbridge MC0: MISC 0
[8765665.177530] EDAC sbridge MC0: PROCESSOR 0:406f1 TIME 1556509734 SOCKET 0 APIC 0
[8765665.177550] EDAC MC0: 0 CE memory read error on CPU_SrcID#0_Ha#0_Chan#2_DIMM#0 (channel:2 slot:0 page:0x3007d50 offset:0x140 grain:32 syndrome:0x0 -  area:DRAM err_code:0000:009f socket:0 ha:0 channel_mask:4 rank:1)
[8765666.600256] {2}[Hardware Error]: Hardware error from APEI Generic Hardware Error Source: 4
[8765666.600259] {2}[Hardware Error]: It has been corrected by h/w and requires no further action
[8765666.600262] {2}[Hardware Error]: event severity: corrected
[8765666.600264] {2}[Hardware Error]:  Error 0, type: corrected
[8765666.600266] {2}[Hardware Error]:  fru_text: A3
[8765666.600268] {2}[Hardware Error]:   section_type: memory error
[8765666.600271] {2}[Hardware Error]:   error_status: 0x0000000000000400
[8765666.600273] {2}[Hardware Error]:   physical_address: 0x0000003064e10180
[8765666.600279] {2}[Hardware Error]:   node: 0 card: 2 module: 0 rank: 1 bank: 2 row: 32336 column: 0
[8765666.600281] {2}[Hardware Error]:   error_type: 2, single-bit ECC
[8765666.600295] mce: [Hardware Error]: Machine check events logged
[8765666.600310] EDAC sbridge MC0: HANDLING MCE MEMORY ERROR
[8765666.600314] EDAC sbridge MC0: CPU 0: Machine Check Event: 0 Bank 255: 940000000000009f
[8765666.600316] EDAC sbridge MC0: TSC 6cfdf49b6021f1
[8765666.600319] EDAC sbridge MC0: ADDR 3064e10180
[8765666.600320] EDAC sbridge MC0: MISC 0
[8765666.600324] EDAC sbridge MC0: PROCESSOR 0:406f1 TIME 1556509735 SOCKET 0 APIC 0
[8765666.600359] EDAC MC0: 0 CE memory read error on CPU_SrcID#0_Ha#0_Chan#2_DIMM#0 (channel:2 slot:0 page:0x3064e10 offset:0x180 grain:32 syndrome:0x0 -  area:DRAM err_code:0000:009f socket:0 ha:0 channel_mask:4 rank:1)
[8765667.162473] mce: [Hardware Error]: Machine check events logged
[8765667.162486] EDAC sbridge MC0: HANDLING MCE MEMORY ERROR
[8765667.162490] EDAC sbridge MC0: CPU 0: Machine Check Event: 0 Bank 255: 940000000000009f
[8765667.162492] EDAC sbridge MC0: TSC 6cfdf510a618ee
[8765667.162494] EDAC sbridge MC0: ADDR 2c68250100
[8765667.162495] EDAC sbridge MC0: MISC 0
[8765667.162498] EDAC sbridge MC0: PROCESSOR 0:406f1 TIME 1556509736 SOCKET 0 APIC 0
[8765667.162517] EDAC MC0: 0 CE memory read error on CPU_SrcID#0_Ha#0_Chan#2_DIMM#0 (channel:2 slot:0 page:0x2c68250 offset:0x100 grain:32 syndrome:0x0 -  area:DRAM err_code:0000:009f socket:0 ha:0 channel_mask:4 rank:1)
[8765668.471777] EDAC sbridge MC0: HANDLING MCE MEMORY ERROR
[8765668.471781] EDAC sbridge MC0: CPU 0: Machine Check Event: 0 Bank 255: 940000000000009f
[8765668.471783] EDAC sbridge MC0: TSC 6cfdf621c6534a
[8765668.471784] EDAC sbridge MC0: ADDR 2b1d6501c0
[8765668.471785] EDAC sbridge MC0: MISC 0
[8765668.471787] EDAC sbridge MC0: PROCESSOR 0:406f1 TIME 1556509737 SOCKET 0 APIC 0
[8765668.471804] EDAC MC0: 0 CE memory read error on CPU_SrcID#0_Ha#0_Chan#2_DIMM#0 (channel:2 slot:0 page:0x2b1d650 offset:0x1c0 grain:32 syndrome:0x0 -  area:DRAM err_code:0000:009f socket:0 ha:0 channel_mask:4 rank:1)
[8765669.021046] EDAC sbridge MC0: HANDLING MCE MEMORY ERROR
[8765669.021051] EDAC sbridge MC0: CPU 0: Machine Check Event: 0 Bank 255: 940000000000009f
[8765669.021053] EDAC sbridge MC0: TSC 6cfdf6945ac12f
[8765669.021054] EDAC sbridge MC0: ADDR 2b6d7d0040
[8765669.021056] EDAC sbridge MC0: MISC 0
[8765669.021058] EDAC sbridge MC0: PROCESSOR 0:406f1 TIME 1556509738 SOCKET 0 APIC 0
[8765669.021078] EDAC MC0: 0 CE memory read error on CPU_SrcID#0_Ha#0_Chan#2_DIMM#0 (channel:2 slot:0 page:0x2b6d7d0 offset:0x40 grain:32 syndrome:0x0 -  area:DRAM err_code:0000:009f socket:0 ha:0 channel_mask:4 rank:1)
[8765671.646384] ghes_print_estatus: 3 callbacks suppressed
[8765671.646387] {3}[Hardware Error]: Hardware error from APEI Generic Hardware Error Source: 4
[8765671.646389] {3}[Hardware Error]: It has been corrected by h/w and requires no further action
[8765671.646390] {3}[Hardware Error]: event severity: corrected
[8765671.646393] {3}[Hardware Error]:  Error 0, type: corrected
[8765671.646395] {3}[Hardware Error]:  fru_text: A3
[8765671.646396] {3}[Hardware Error]:   section_type: memory error
[8765671.646398] {3}[Hardware Error]:   error_status: 0x0000000000000400
[8765671.646399] {3}[Hardware Error]:   physical_address: 0x000000303a1d0080
[8765671.646403] {3}[Hardware Error]:   node: 0 card: 2 module: 0 rank: 1 bank: 2 row: 31671 column: 0
[8765671.646404] {3}[Hardware Error]:   error_type: 2, single-bit ECC
[8765671.646431] EDAC sbridge MC0: HANDLING MCE MEMORY ERROR
[8765671.646433] EDAC sbridge MC0: CPU 0: Machine Check Event: 0 Bank 255: 940000000000009f
[8765671.646435] EDAC sbridge MC0: TSC 6cfdf8b8052f87
[8765671.646437] EDAC sbridge MC0: ADDR 303a1d0080
[8765671.646442] EDAC sbridge MC0: MISC 0
[8765671.646448] EDAC sbridge MC0: PROCESSOR 0:406f1 TIME 1556509740 SOCKET 0 APIC 0
[8765671.646469] EDAC MC0: 0 CE memory read error on CPU_SrcID#0_Ha#0_Chan#2_DIMM#0 (channel:2 slot:0 page:0x303a1d0 offset:0x80 grain:32 syndrome:0x0 -  area:DRAM err_code:0000:009f socket:0 ha:0 channel_mask:4 rank:1)
[8765672.579945] {4}[Hardware Error]: Hardware error from APEI Generic Hardware Error Source: 4
[8765672.579948] {4}[Hardware Error]: It has been corrected by h/w and requires no further action
[8765672.579950] {4}[Hardware Error]: event severity: corrected
[8765672.579952] {4}[Hardware Error]:  Error 0, type: corrected
[8765672.579953] {4}[Hardware Error]:  fru_text: A3
[8765672.579955] {4}[Hardware Error]:   section_type: memory error
[8765672.579956] {4}[Hardware Error]:   error_status: 0x0000000000000400
[8765672.579958] {4}[Hardware Error]:   physical_address: 0x0000002c7e450100
[8765672.579962] {4}[Hardware Error]:   node: 0 card: 2 module: 0 rank: 1 bank: 2 row: 24561 column: 0
[8765672.579963] {4}[Hardware Error]:   error_type: 2, single-bit ECC
[8765672.579989] EDAC sbridge MC0: HANDLING MCE MEMORY ERROR
[8765672.579994] EDAC sbridge MC0: CPU 0: Machine Check Event: 0 Bank 255: 940000000000009f
[8765672.580001] EDAC sbridge MC0: TSC 6cfdf97ac3d24a
[8765672.580005] EDAC sbridge MC0: ADDR 2c7e450100
[8765672.580011] EDAC sbridge MC0: MISC 0
[8765672.580016] EDAC sbridge MC0: PROCESSOR 0:406f1 TIME 1556509741 SOCKET 0 APIC 0
[8765672.580038] EDAC MC0: 0 CE memory read error on CPU_SrcID#0_Ha#0_Chan#2_DIMM#0 (channel:2 slot:0 page:0x2c7e450 offset:0x100 grain:32 syndrome:0x0 -  area:DRAM err_code:0000:009f socket:0 ha:0 channel_mask:4 rank:1)
[8765675.862421] EDAC sbridge MC0: HANDLING MCE MEMORY ERROR
[8765675.862426] EDAC sbridge MC0: CPU 0: Machine Check Event: 0 Bank 255: 940000000000009f
[8765675.862428] EDAC sbridge MC0: TSC 6cfdfc277f1f35
[8765675.862430] EDAC sbridge MC0: ADDR 2f6a090080
[8765675.862432] EDAC sbridge MC0: MISC 0
[8765675.862435] EDAC sbridge MC0: PROCESSOR 0:406f1 TIME 1556509745 SOCKET 0 APIC 0
[8765675.862459] EDAC MC0: 0 CE memory read error on CPU_SrcID#0_Ha#0_Chan#2_DIMM#0 (channel:2 slot:0 page:0x2f6a090 offset:0x80 grain:32 syndrome:0x0 -  area:DRAM err_code:0000:009f socket:0 ha:0 channel_mask:4 rank:1)
[8765705.618856] ghes_print_estatus: 1 callbacks suppressed
[8765705.618857] {5}[Hardware Error]: Hardware error from APEI Generic Hardware Error Source: 4
[8765705.618859] {5}[Hardware Error]: It has been corrected by h/w and requires no further action
[8765705.618860] {5}[Hardware Error]: event severity: corrected
[8765705.618862] {5}[Hardware Error]:  Error 0, type: corrected
[8765705.618863] {5}[Hardware Error]:  fru_text: A3
[8765705.618863] {5}[Hardware Error]:   section_type: memory error
[8765705.618864] {5}[Hardware Error]:   error_status: 0x0000000000000400
[8765705.618865] {5}[Hardware Error]:   physical_address: 0x0000002f54f10100
[8765705.618867] {5}[Hardware Error]:   node: 0 card: 2 module: 0 rank: 1 bank: 2 row: 32068 column: 0
[8765705.618868] {5}[Hardware Error]:   error_type: 2, single-bit ECC
[8765705.618889] EDAC sbridge MC0: HANDLING MCE MEMORY ERROR
[8765705.618890] EDAC sbridge MC0: CPU 0: Machine Check Event: 0 Bank 255: 940000000000009f
[8765705.618890] EDAC sbridge MC0: TSC 6cfe1466d55bd8
[8765705.618892] EDAC sbridge MC0: ADDR 2f54f10100
[8765705.618892] EDAC sbridge MC0: MISC 0
[8765705.618893] EDAC sbridge MC0: PROCESSOR 0:406f1 TIME 1556509774 SOCKET 0 APIC 0
[8765705.618912] EDAC MC0: 0 CE memory read error on CPU_SrcID#0_Ha#0_Chan#2_DIMM#0 (channel:2 slot:0 page:0x2f54f10 offset:0x100 grain:32 syndrome:0x0 -  area:DRAM err_code:0000:009f socket:0 ha:0 channel_mask:4 rank:1)
[8765713.976579] {6}[Hardware Error]: Hardware error from APEI Generic Hardware Error Source: 4
[8765713.976581] {6}[Hardware Error]: It has been corrected by h/w and requires no further action
[8765713.976582] {6}[Hardware Error]: event severity: corrected
[8765713.976583] {6}[Hardware Error]:  Error 0, type: corrected
[8765713.976584] {6}[Hardware Error]:  fru_text: A3
[8765713.976585] {6}[Hardware Error]:   section_type: memory error
[8765713.976586] {6}[Hardware Error]:   error_status: 0x0000000000000400
[8765713.976586] {6}[Hardware Error]:   physical_address: 0x0000002b44fd0080
[8765713.976589] {6}[Hardware Error]:   node: 0 card: 2 module: 0 rank: 1 bank: 2 row: 23623 column: 0
[8765713.976589] {6}[Hardware Error]:   error_type: 2, single-bit ECC
[8765713.976608] EDAC sbridge MC0: HANDLING MCE MEMORY ERROR
[8765713.976609] EDAC sbridge MC0: CPU 0: Machine Check Event: 0 Bank 255: 940000000000009f
[8765713.976610] EDAC sbridge MC0: TSC 6cfe1b364afc97
[8765713.976611] EDAC sbridge MC0: ADDR 2b44fd0080
[8765713.976611] EDAC sbridge MC0: MISC 0
[8765713.976613] EDAC sbridge MC0: PROCESSOR 0:406f1 TIME 1556509783 SOCKET 0 APIC 0
[8765713.976630] EDAC MC0: 0 CE memory read error on CPU_SrcID#0_Ha#0_Chan#2_DIMM#0 (channel:2 slot:0 page:0x2b44fd0 offset:0x80 grain:32 syndrome:0x0 -  area:DRAM err_code:0000:009f socket:0 ha:0 channel_mask:4 rank:1)
[8765720.214132] {7}[Hardware Error]: Hardware error from APEI Generic Hardware Error Source: 4
[8765720.214134] {7}[Hardware Error]: It has been corrected by h/w and requires no further action
[8765720.214136] {7}[Hardware Error]: event severity: corrected
[8765720.214138] {7}[Hardware Error]:  Error 0, type: corrected
[8765720.214139] {7}[Hardware Error]:  fru_text: A3
[8765720.214140] {7}[Hardware Error]:   section_type: memory error
[8765720.214141] {7}[Hardware Error]:   error_status: 0x0000000000000400
[8765720.214142] {7}[Hardware Error]:   physical_address: 0x0000002c63810140
[8765720.214145] {7}[Hardware Error]:   node: 0 card: 2 module: 0 rank: 1 bank: 2 row: 24120 column: 0
[8765720.214146] {7}[Hardware Error]:   error_type: 2, single-bit ECC
[8765720.214169] EDAC sbridge MC0: HANDLING MCE MEMORY ERROR
[8765720.214170] EDAC sbridge MC0: CPU 0: Machine Check Event: 0 Bank 255: 940000000000009f
[8765720.214175] EDAC sbridge MC0: TSC 6cfe204b79eae5
[8765720.214180] EDAC sbridge MC0: ADDR 2c63810140
[8765720.214183] EDAC sbridge MC0: MISC 0
[8765720.214186] EDAC sbridge MC0: PROCESSOR 0:406f1 TIME 1556509789 SOCKET 0 APIC 0
[8765720.214204] EDAC MC0: 0 CE memory read error on CPU_SrcID#0_Ha#0_Chan#2_DIMM#0 (channel:2 slot:0 page:0x2c63810 offset:0x140 grain:32 syndrome:0x0 -  area:DRAM err_code:0000:009f socket:0 ha:0 channel_mask:4 rank:1)
[8765743.843363] {8}[Hardware Error]: Hardware error from APEI Generic Hardware Error Source: 4
[8765743.843366] {8}[Hardware Error]: It has been corrected by h/w and requires no further action
[8765743.843368] {8}[Hardware Error]: event severity: corrected
[8765743.843370] {8}[Hardware Error]:  Error 0, type: corrected
[8765743.843372] {8}[Hardware Error]:  fru_text: A3
[8765743.843373] {8}[Hardware Error]:   section_type: memory error
[8765743.843375] {8}[Hardware Error]:   error_status: 0x0000000000000400
[8765743.843376] {8}[Hardware Error]:   physical_address: 0x0000002f61b900c0
[8765743.843380] {8}[Hardware Error]:   node: 0 card: 2 module: 0 rank: 1 bank: 2 row: 32270 column: 0
[8765743.843382] {8}[Hardware Error]:   error_type: 2, single-bit ECC
[8765743.843394] mce_notify_irq: 8 callbacks suppressed
[8765743.843395] mce: [Hardware Error]: Machine check events logged
[8765743.843413] EDAC sbridge MC0: HANDLING MCE MEMORY ERROR
[8765743.843417] EDAC sbridge MC0: CPU 0: Machine Check Event: 0 Bank 255: 940000000000009f
[8765743.843419] EDAC sbridge MC0: TSC 6cfe338ca4f6e8
[8765743.843421] EDAC sbridge MC0: ADDR 2f61b900c0
[8765743.843422] EDAC sbridge MC0: MISC 0
[8765743.843426] EDAC sbridge MC0: PROCESSOR 0:406f1 TIME 1556509813 SOCKET 0 APIC 0
[8765743.843453] EDAC MC0: 0 CE memory read error on CPU_SrcID#0_Ha#0_Chan#2_DIMM#0 (channel:2 slot:0 page:0x2f61b90 offset:0xc0 grain:32 syndrome:0x0 -  area:DRAM err_code:0000:009f socket:0 ha:0 channel_mask:4 rank:1)
[8765772.895625] {9}[Hardware Error]: Hardware error from APEI Generic Hardware Error Source: 4
[8765772.895628] {9}[Hardware Error]: It has been corrected by h/w and requires no further action
[8765772.895629] {9}[Hardware Error]: event severity: corrected
[8765772.895631] {9}[Hardware Error]:  Error 0, type: corrected
[8765772.895632] {9}[Hardware Error]:  fru_text: A3
[8765772.895634] {9}[Hardware Error]:   section_type: memory error
[8765772.895635] {9}[Hardware Error]:   error_status: 0x0000000000000400
[8765772.895636] {9}[Hardware Error]:   physical_address: 0x0000002b438d00c0
[8765772.895640] {9}[Hardware Error]:   node: 0 card: 2 module: 0 rank: 1 bank: 2 row: 23595 column: 0
[8765772.895641] {9}[Hardware Error]:   error_type: 2, single-bit ECC
[8765772.895652] mce: [Hardware Error]: Machine check events logged
[8765772.895664] EDAC sbridge MC0: HANDLING MCE MEMORY ERROR
[8765772.895667] EDAC sbridge MC0: CPU 0: Machine Check Event: 0 Bank 255: 940000000000009f
[8765772.895668] EDAC sbridge MC0: TSC 6cfe4b39148a1f
[8765772.895675] EDAC sbridge MC0: ADDR 2b438d00c0
[8765772.895680] EDAC sbridge MC0: MISC 0
[8765772.895687] EDAC sbridge MC0: PROCESSOR 0:406f1 TIME 1556509842 SOCKET 0 APIC 0
[8765772.895707] EDAC MC0: 0 CE memory read error on CPU_SrcID#0_Ha#0_Chan#2_DIMM#0 (channel:2 slot:0 page:0x2b438d0 offset:0xc0 grain:32 syndrome:0x0 -  area:DRAM err_code:0000:009f socket:0 ha:0 channel_mask:4 rank:1)
[8765824.674769] mce: [Hardware Error]: Machine check events logged
[8765845.214136] {10}[Hardware Error]: Hardware error from APEI Generic Hardware Error Source: 4
[8765845.214139] {10}[Hardware Error]: It has been corrected by h/w and requires no further action
[8765845.214141] {10}[Hardware Error]: event severity: corrected
[8765845.214144] {10}[Hardware Error]:  Error 0, type: corrected
[8765845.214145] {10}[Hardware Error]:  fru_text: A3
[8765845.214147] {10}[Hardware Error]:   section_type: memory error
[8765845.214149] {10}[Hardware Error]:   error_status: 0x0000000000000400
[8765845.214150] {10}[Hardware Error]:   physical_address: 0x0000002f7b1101c0
[8765845.214154] {10}[Hardware Error]:   node: 0 card: 2 module: 0 rank: 1 bank: 2 row: 32684 column: 0
[8765845.214156] {10}[Hardware Error]:   error_type: 2, single-bit ECC
[8765845.214179] EDAC sbridge MC0: HANDLING MCE MEMORY ERROR
[8765845.214182] EDAC sbridge MC0: CPU 0: Machine Check Event: 0 Bank 255: 940000000000009f
[8765845.214183] EDAC sbridge MC0: TSC 6cfe86270ec28e
[8765845.214185] EDAC sbridge MC0: ADDR 2f7b1101c0
[8765845.214187] EDAC sbridge MC0: MISC 0
[8765845.214189] EDAC sbridge MC0: PROCESSOR 0:406f1 TIME 1556509914 SOCKET 0 APIC 0
[8765845.214212] EDAC MC0: 0 CE memory read error on CPU_SrcID#0_Ha#0_Chan#2_DIMM#0 (channel:2 slot:0 page:0x2f7b110 offset:0x1c0 grain:32 syndrome:0x0 -  area:DRAM err_code:0000:009f socket:0 ha:0 channel_mask:4 rank:1)
[8765907.151119] {11}[Hardware Error]: Hardware error from APEI Generic Hardware Error Source: 4
[8765907.151123] {11}[Hardware Error]: It has been corrected by h/w and requires no further action
[8765907.151125] {11}[Hardware Error]: event severity: corrected
[8765907.151128] {11}[Hardware Error]:  Error 0, type: corrected
[8765907.151130] {11}[Hardware Error]:  fru_text: A3
[8765907.151133] {11}[Hardware Error]:   section_type: memory error
[8765907.151135] {11}[Hardware Error]:   error_status: 0x0000000000000400
[8765907.151138] {11}[Hardware Error]:   physical_address: 0x0000002c35f50140
[8765907.151144] {11}[Hardware Error]:   node: 0 card: 2 module: 0 rank: 1 bank: 2 row: 23389 column: 0
[8765907.151146] {11}[Hardware Error]:   error_type: 2, single-bit ECC
[8765907.151164] mce: [Hardware Error]: Machine check events logged
[8765907.151185] EDAC sbridge MC0: HANDLING MCE MEMORY ERROR
[8765907.151194] EDAC sbridge MC0: CPU 0: Machine Check Event: 0 Bank 255: 940000000000009f
[8765907.151202] EDAC sbridge MC0: TSC 6cfeb89f66b2fd
[8765907.151210] EDAC sbridge MC0: ADDR 2c35f50140
[8765907.151216] EDAC sbridge MC0: MISC 0
[8765907.151225] EDAC sbridge MC0: PROCESSOR 0:406f1 TIME 1556509976 SOCKET 0 APIC 0
[8765907.151259] EDAC MC0: 0 CE memory read error on CPU_SrcID#0_Ha#0_Chan#2_DIMM#0 (channel:2 slot:0 page:0x2c35f50 offset:0x140 grain:32 syndrome:0x0 -  area:DRAM err_code:0000:009f socket:0 ha:0 channel_mask:4 rank:1)
[8765921.522035] {12}[Hardware Error]: Hardware error from APEI Generic Hardware Error Source: 4
[8765921.522038] {12}[Hardware Error]: It has been corrected by h/w and requires no further action
[8765921.522039] {12}[Hardware Error]: event severity: corrected
[8765921.522041] {12}[Hardware Error]:  Error 0, type: corrected
[8765921.522042] {12}[Hardware Error]:  fru_text: A3
[8765921.522044] {12}[Hardware Error]:   section_type: memory error
[8765921.522045] {12}[Hardware Error]:   error_status: 0x0000000000000400
[8765921.522046] {12}[Hardware Error]:   physical_address: 0x0000002b40a90080
[8765921.522049] {12}[Hardware Error]:   node: 0 card: 2 module: 0 rank: 1 bank: 2 row: 23554 column: 0
[8765921.522050] {12}[Hardware Error]:   error_type: 2, single-bit ECC
[8765921.522062] mce: [Hardware Error]: Machine check events logged
[8765921.522079] EDAC sbridge MC0: HANDLING MCE MEMORY ERROR
[8765921.522081] EDAC sbridge MC0: CPU 0: Machine Check Event: 0 Bank 255: 940000000000009f
[8765921.522087] EDAC sbridge MC0: TSC 6cfec4553ce679
[8765921.522092] EDAC sbridge MC0: ADDR 2b40a90080
[8765921.522096] EDAC sbridge MC0: MISC 0
[8765921.522101] EDAC sbridge MC0: PROCESSOR 0:406f1 TIME 1556509990 SOCKET 0 APIC 0
[8765921.522123] EDAC MC0: 0 CE memory read error on CPU_SrcID#0_Ha#0_Chan#2_DIMM#0 (channel:2 slot:0 page:0x2b40a90 offset:0x80 grain:32 syndrome:0x0 -  area:DRAM err_code:0000:009f socket:0 ha:0 channel_mask:4 rank:1)
[8765921.758822] {13}[Hardware Error]: Hardware error from APEI Generic Hardware Error Source: 4
[8765921.758824] {13}[Hardware Error]: It has been corrected by h/w and requires no further action
[8765921.758825] {13}[Hardware Error]: event severity: corrected
[8765921.758827] {13}[Hardware Error]:  Error 0, type: corrected
[8765921.758828] {13}[Hardware Error]:  fru_text: A3
[8765921.758829] {13}[Hardware Error]:   section_type: memory error
[8765921.758830] {13}[Hardware Error]:   error_status: 0x0000000000000400
[8765921.758831] {13}[Hardware Error]:   physical_address: 0x0000002c583d0080
[8765921.758834] {13}[Hardware Error]:   node: 0 card: 2 module: 0 rank: 1 bank: 2 row: 23959 column: 0
[8765921.758835] {13}[Hardware Error]:   error_type: 2, single-bit ECC
[8765921.758858] EDAC sbridge MC0: HANDLING MCE MEMORY ERROR
[8765921.758859] EDAC sbridge MC0: CPU 0: Machine Check Event: 0 Bank 255: 940000000000009f
[8765921.758862] EDAC sbridge MC0: TSC 6cfec486a1d4d7
[8765921.758868] EDAC sbridge MC0: ADDR 2c583d0080
[8765921.758871] EDAC sbridge MC0: MISC 0
[8765921.758875] EDAC sbridge MC0: PROCESSOR 0:406f1 TIME 1556509990 SOCKET 0 APIC 0
[8765921.758893] EDAC MC0: 0 CE memory read error on CPU_SrcID#0_Ha#0_Chan#2_DIMM#0 (channel:2 slot:0 page:0x2c583d0 offset:0x80 grain:32 syndrome:0x0 -  area:DRAM err_code:0000:009f socket:0 ha:0 channel_mask:4 rank:1)
[8765988.521314] mce_notify_irq: 1 callbacks suppressed
[8765988.521332] mce: [Hardware Error]: Machine check events logged

Looks like:

[8765665.177550] EDAC MC0: 0 CE memory read error on CPU_SrcID#0_Ha#0_Chan#2_DIMM#0 (channel:2 slot:0 page:0x3007d50 offset:0x140 grain:32 syndrome:0x0 -  area:DRAM err_code:0000:009f socket:0 ha:0 channel_mask:4 rank:1)
[8765666.600279] {2}[Hardware Error]:   node: 0 card: 2 module: 0 rank: 1 bank: 2 row: 32336 column: 0

There is also this message:

[8765666.600256] {2}[Hardware Error]: Hardware error from APEI Generic Hardware Error Source: 4
[8765666.600259] {2}[Hardware Error]: It has been corrected by h/w and requires no further action
[8765666.600262] {2}[Hardware Error]: event severity: corrected
[8765666.600264] {2}[Hardware Error]:  Error 0, type: corrected

There is also nothing on HW logs, so I guess nothing needs to be done for now?
The icinga alert has not yet cleared up.

@Cmjohnson if this host would need a maintenance window, it would need to be coordinate with the service owners @elukey and @Ottomata

Event Timeline

Restricted Application added a subscriber: Liuxinyu970226. · View Herald Transcript
Marostegui moved this task from Triage to Blocked external/Not db team on the DBA board.

I've acked the alert for now.

Thanks a lot @Marostegui! We can shutdown the host without any problem, it only needs a ~10m heads up to properly stop eventlogging data and mariadb (and replication).

Thanks - not sure how to proceed as the dmesg entries show that the issue is fixed but icinga is still reporting errors (I just forced the recheck).
Let's see what @Cmjohnson suggests

@Marostegui sorry I was under the impression that we'd have needed to wait for a feedback from Chris/Rob about how to proceed. Is there anything pending on the Analytics side?

@elukey sorry, I realised that I didn't sent the first sentence:
"The errors corrected themselves and Icinga is now all green, which is something we have seen in the past with correctable errors". Hence, I am not sure if we should close this task for now (and re-open if it happens again) or you prefer to leave it open?

I now have h/w log entries. I will need the server to be taken offline so I can relocate the DIMM and check to see if the error follows. Unfortunately there are steps involved that need to be done to get Dell to replace.

Record: 84
Date/Time: 04/29/2019 09:35:48
Source: system
Severity: Non-Critical

Description: Correctable memory error rate exceeded for DIMM_A3.

Record: 85
Date/Time: 04/29/2019 09:35:51
Source: system
Severity: Critical
Description: Correctable memory error rate exceeded for DIMM_A3.

[18:50:55]  <cmjohnson1>	marostegui i am confused over db1007...is there an issue or not an issue?  There is a h/w log entry from 4/29 but nothing since 
[18:52:01]  <marostegui>	cmjohnson1: yeah, that is my confusion too, it was a correctable error, which got corrected itself
[18:52:08]  <marostegui>	so not sure if we should go for the dimm exchange or not
[18:52:13]  <marostegui>	cmjohnson1: what do you advise?
[18:52:50]  <cmjohnson1>	let's move the dimm from A3 to B3 and clear the log...if the error returns I will know what to do next. 
[18:53:01]  <marostegui>	sounds good
[18:53:14]  <marostegui>	I will paste that on the ticket so luca can coordinate (as he needs to stop the service)

cc @elukey

@Cmjohnson I'd need a heads up of ~15 mins before the maintenance to shutdown the host properly, but we can do it anytime!

Closing this for now as it self-recovered and never showed up again.

jcrespo subscribed.

1UEFI0107: One or more memory errors have occurred on memory slot: A3.
2Remove input power to the system, reseat the DIMM module and restart the
3system. If the issues persist, replace the faulty memory module identified in
4the message.
5
6UEFI0081: Memory configuration has changed from the last time the system was
7started.
8If the change is expected, no action is necessary. Otherwise, check the DIMM
9population inside the system and memory settings in System Setup.
10
11UEFI0058: Uncorrectable Memory Error has occurred because a Dual Inline Memory
12Module (DIMM) is not functioning.
13Check the System Event Log (SEL) to identify the non-functioning DIMM, and then
14replace it.
15
16
17Available Actions:
18F1 to Continue and Retry Boot Order
19F2 for System Setup (BIOS)
20F10 for LifeCycle Controller
21- Enable/Configure iDRAC
22- Update or Backup/Restore Server Firmware
23- Help Install an Operating System
24F11 for Boot Manager

@Cmjohnson as per the error @jcrespo pasted above is that enough to get Dell to send a new DIMM you think?

I still need to move the DIMM around ...I need the server taken down. If this needs to be scheduled, please let me know when you can have the server down?

Chris

Chris, you will need to coordinate with @elukey principally, as he is the person in touch directly with users affected to agree on a date.

We can do it anytime with 10/15 mins of heads up Chris (I need to stop replication and traffic to db1107 before you can operate). Ping me on IRC! :)

one last paste of the idrac log before i clear it

Record: 84
Date/Time: 04/29/2019 09:35:48
Source: system
Severity: Non-Critical

Description: Correctable memory error rate exceeded for DIMM_A3.

Record: 85
Date/Time: 04/29/2019 09:35:51
Source: system
Severity: Critical

Description: Correctable memory error rate exceeded for DIMM_A3.

Record: 86
Date/Time: 06/24/2019 08:20:24
Source: system
Severity: Ok

Description: A problem was detected in Memory Reference Code (MRC).

Record: 87
Date/Time: 06/24/2019 08:20:24
Source: system
Severity: Critical
Description: Multi-bit memory errors detected on a memory device at location(s) DIMM_A3.

Swapped DIMM A3 with DIMM B3, now we have to powrer the server back on and let it go for a few days to see if the error returns and where it returns.

Resolving this task for now, if the error returns please re-open and ping me.