Looks like pc2010 had a memory issue:
[Mon Jul 8 15:20:13 2019] mce: [Hardware Error]: Machine check events logged [Mon Jul 8 15:20:13 2019] mce: Uncorrected hardware memory error in user-access at 3b134a4880 [Mon Jul 8 15:20:13 2019] {1}Hardware error detected on CPU11 [Mon Jul 8 15:20:13 2019] {1}event severity: recoverable [Mon Jul 8 15:20:13 2019] {1} Error 0, type: recoverable [Mon Jul 8 15:20:13 2019] {1} fru_text: B1 [Mon Jul 8 15:20:13 2019] {1} section_type: memory error [Mon Jul 8 15:20:13 2019] {1} error_status: 0x0000000000000400 [Mon Jul 8 15:20:13 2019] {1} physical_address: 0x0000003b134a4880 [Mon Jul 8 15:20:13 2019] {1} node: 2 card: 0 module: 0 rank: 0 bank: 1 row: 41698 column: 152 [Mon Jul 8 15:20:13 2019] {1} DIMM location: not present. DMI handle: 0x0000 [Mon Jul 8 15:20:13 2019] Memory failure: 0x3b134a4: Killing mysqld:3693 due to hardware memory corruption [Mon Jul 8 15:20:13 2019] Memory failure: 0x3b134a4: recovery action for dirty LRU page: Recovered [Mon Jul 8 15:20:42 2019] MCE: Killing mysqld:3760 due to hardware memory corruption fault at 7f0534f68880
I rebooted it to see if it would show up on during boot and on HW logs, and it did:
UEFI0079: One or more uncorrectable Memory errors occurred in the previous boot. Check the System Event Log (SEL) to identify the non-functional DIMM, and then replace the DIMM. Available Actions: F1 to Continue and Retry Boot Order F2 for System Setup (BIOS) F10 for LifeCycle Controller - Enable/Configure iDRAC - Update or Backup/Restore Server Firmware - Help Install an Operating System F11 for Boot Manager
I pressed F1 and the boot continued.
The HW logs now show the issue:
CreationTimestamp = 20190708220429.000000-300 ElementName = System Event Log Entry RecordData = Multi-bit memory errors detected on a memory device at location(s) DIMM_B1. RecordFormat = string Description RecordID = 2 CreationTimestamp = 20190709045530.000000-300 ElementName = System Event Log Entry RecordData = Correctable memory error logging disabled for a memory device at location DIMM_B1. RecordFormat = string Description RecordID = 3
@Papaul should we move that DIMM to another position so we can see if it is the DIMM or the mainboard in case it happens again?
The memory accounted in the server looks correct:
root@pc2010:~# free -m total used free shared buff/cache available Mem: 257622 682 256597 9 342 255649 Swap: 7628 0 7628