db2140's mysql crashed due to HW memory errors:
mysql log trace:
Jan 03 00:32:04 db2140 mysqld[1320]: 210103 0:32:04 [ERROR] mysqld got signal 7 ; Jan 03 00:32:04 db2140 mysqld[1320]: This could be because you hit a bug. It is also possible that this binary Jan 03 00:32:04 db2140 mysqld[1320]: or one of the libraries it was linked against is corrupt, improperly built, Jan 03 00:32:04 db2140 mysqld[1320]: or misconfigured. This error can also be caused by malfunctioning hardware. Jan 03 00:32:04 db2140 mysqld[1320]: To report this bug, see https://mariadb.com/kb/en/reporting-bugs Jan 03 00:32:04 db2140 mysqld[1320]: We will try our best to scrape up some info that will hopefully help Jan 03 00:32:04 db2140 mysqld[1320]: diagnose the problem, but since we have already crashed, Jan 03 00:32:04 db2140 mysqld[1320]: something is definitely wrong and this may fail. Jan 03 00:32:04 db2140 mysqld[1320]: Server version: 10.4.13-MariaDB-log Jan 03 00:32:04 db2140 mysqld[1320]: key_buffer_size=134217728 Jan 03 00:32:04 db2140 mysqld[1320]: read_buffer_size=131072 Jan 03 00:32:04 db2140 mysqld[1320]: max_used_connections=35
And the HW memory errors on A7 memory dimm:
[5222445.069567] mce: Uncorrected hardware memory error in user-access at 4432f124c0 [5222445.069584] mce: [Hardware Error]: Machine check events logged [5222445.069757] {2}[Hardware Error]: Hardware error from APEI Generic Hardware Error Source: 4 [5222445.070251] Memory failure: 0x4432f12: Killing mysqld:1320 due to hardware memory corruption [5222445.070255] Memory failure: 0x4432f12: recovery action for dirty LRU page: Recovered [5222445.093612] {2}[Hardware Error]: It has been corrected by h/w and requires no further action [5222445.093616] {2}[Hardware Error]: event severity: corrected [5222445.093618] {2}[Hardware Error]: Error 0, type: corrected [5222445.093619] {2}[Hardware Error]: fru_text: A7 [5222445.093620] {2}[Hardware Error]: section_type: memory error [5222445.093623] {2}[Hardware Error]: error_status: 0x0000000000000400 [5222445.093626] {2}[Hardware Error]: physical_address: 0x0000004432f124c0 [5222445.093629] {2}[Hardware Error]: node: 0 card: 0 module: 1 rank: 1 bank: 2 device: 0 row: 38062 column: 848 [5222445.093630] {2}[Hardware Error]: error_type: 3, multi-bit ECC [5222445.093631] {2}[Hardware Error]: DIMM location: not present. DMI handle: 0x0000 [5222445.093655] mce: [Hardware Error]: Machine check events logged [5222447.552327] MCE: Killing mysqld:1386 due to hardware memory corruption fault at 7f4a729124c0 [5227685.337025] {3}[Hardware Error]: Hardware error from APEI Generic Hardware Error Source: 4 [5227685.337026] {3}[Hardware Error]: It has been corrected by h/w and requires no further action [5227685.337027] {3}[Hardware Error]: event severity: corrected [5227685.337028] {3}[Hardware Error]: Error 0, type: corrected [5227685.337029] {3}[Hardware Error]: fru_text: A7 [5227685.337029] {3}[Hardware Error]: section_type: memory error [5227685.337030] {3}[Hardware Error]: error_status: 0x0000000000000400 [5227685.337031] {3}[Hardware Error]: physical_address: 0x00000043927524c0 [5227685.337032] {3}[Hardware Error]: node: 0 card: 0 module: 1 rank: 1 bank: 2 device: 0 row: 38062 column: 200 [5227685.337033] {3}[Hardware Error]: error_type: 14, scrub uncorrected error [5227685.337034] {3}[Hardware Error]: DIMM location: not present. DMI handle: 0x0000 [5227685.337056] mce: [Hardware Error]: Machine check events logged [5227786.150666] {4}[Hardware Error]: Hardware error from APEI Generic Hardware Error Source: 4 [5227786.150668] {4}[Hardware Error]: It has been corrected by h/w and requires no further action [5227786.150669] {4}[Hardware Error]: event severity: corrected [5227786.150670] {4}[Hardware Error]: Error 0, type: corrected [5227786.150670] {4}[Hardware Error]: fru_text: A7 [5227786.150672] {4}[Hardware Error]: section_type: memory error [5227786.150673] {4}[Hardware Error]: error_status: 0x0000000000000400 [5227786.150673] {4}[Hardware Error]: physical_address: 0x00000043aa7524c0 [5227786.150675] {4}[Hardware Error]: node: 0 card: 0 module: 1 rank: 1 bank: 2 device: 0 row: 38062 column: 296 [5227786.150676] {4}[Hardware Error]: error_type: 13, scrub corrected error [5227786.150677] {4}[Hardware Error]: DIMM location: not present. DMI handle: 0x0000 [5227786.150702] mce: [Hardware Error]: Machine check events logged [5227954.553286] {5}[Hardware Error]: Hardware error from APEI Generic Hardware Error Source: 4 [5227954.553288] {5}[Hardware Error]: It has been corrected by h/w and requires no further action [5227954.553289] {5}[Hardware Error]: event severity: corrected [5227954.553290] {5}[Hardware Error]: Error 0, type: corrected [5227954.553291] {5}[Hardware Error]: fru_text: A7 [5227954.553292] {5}[Hardware Error]: section_type: memory error [5227954.553293] {5}[Hardware Error]: error_status: 0x0000000000000400 [5227954.553294] {5}[Hardware Error]: physical_address: 0x00000043d27524c0 [5227954.553296] {5}[Hardware Error]: node: 0 card: 0 module: 1 rank: 1 bank: 2 device: 0 row: 38062 column: 456 [5227954.553297] {5}[Hardware Error]: error_type: 14, scrub uncorrected error [5227954.553299] {5}[Hardware Error]: DIMM location: not present. DMI handle: 0x0000 [5227954.553324] mce: [Hardware Error]: Machine check events logged [5228023.908558] {6}[Hardware Error]: Hardware error from APEI Generic Hardware Error Source: 4 [5228023.908559] {6}[Hardware Error]: It has been corrected by h/w and requires no further action [5228023.908560] {6}[Hardware Error]: event severity: corrected [5228023.908562] {6}[Hardware Error]: Error 0, type: corrected [5228023.908562] {6}[Hardware Error]: fru_text: A7 [5228023.908564] {6}[Hardware Error]: section_type: memory error [5228023.908565] {6}[Hardware Error]: error_status: 0x0000000000000400 [5228023.908565] {6}[Hardware Error]: physical_address: 0x00000043e2f524c0 [5228023.908567] {6}[Hardware Error]: node: 0 card: 0 module: 1 rank: 1 bank: 2 device: 0 row: 38062 column: 536 [5228023.908568] {6}[Hardware Error]: error_type: 14, scrub uncorrected error [5228023.908569] {6}[Hardware Error]: DIMM location: not present. DMI handle: 0x0000
And these are the Dell HW logs which have been happening since December:
------------------------------------------------------------------------------- Record: 2 Date/Time: 12/06/2020 05:50:15 Source: system Severity: Critical Description: The system memory has faced an uncorrectable multi-bit memory errors in the non-execution path of a memory device at the location DIMM_A7. ------------------------------------------------------------------------------- ------------------------------------------------------------------------------ Record: 23 Date/Time: 01/03/2021 02:05:11 Source: system Severity: Critical Description: The system memory has faced an uncorrectable multi-bit memory errors in the non-execution path of a memory device at the location DIMM_A7. ------------------------------------------------------------------------------- ------------------------------------------------------------------------------- Record: 6 Date/Time: 01/03/2021 00:32:12 Source: system Severity: Critical Description: The system memory has faced an uncorrectable multi-bit memory errors in the non-execution path of a memory device at the location DIMM_A7. -------------------------------------------------------------------------------
@Papaul should we exchange that DIMM with another one and wait for another failure?