Message on restart:
462 - Uncorrectable Memory Error Threshold Exceeded (Processor 2, DIMM 4). The DIMM is mapped out and is currently not available. Action: Take corrective action for the failing DIMM. Re-map all DIMMs back into the memory map in RBSU. If the issue persists, contact support. 511 - One or more DIMMs have been mapped out due to a memory error, resulting in an unbalanced memory configuration across memory controllers. This may result in non-optimal memory performance. Action: See the Integrated Management Log (IML) for information on the memory error. Consult documentation for memory population guidelines.
HW logs:
/system1/log1/record38 Targets Properties number=38 severity=Critical date=05/12/2020 time=01:50:28 description=Uncorrectable Machine Check Exception (Processor 2, APIC ID 0x00000022, Bank 0x00000008, Status 0xBC000000'01010091, Address 0x00000044'78E5EF40, Misc 0x200405C2'88202086). Verbs /system1/log1/record39 Targets Properties number=39 severity=Critical date=05/12/2020 time=01:50:28 description=DIMM Failure - Uncorrectable Memory Error (Processor 2, DIMM 4) /system1/log1/record40 Targets Properties number=40 severity=Critical date=05/12/2020 time=01:50:52 description=Uncorrectable Memory Error Threshold Exceeded (Processor 2, DIMM 4). The DIMM is mapped out and is currently not available. Verbs /system1/log1/record41 Targets Properties number=41 severity=Informational date=05/12/2020 time=01:54:52 description=One or more DIMMs have been mapped out due to a memory error, resulting in an unbalanced memory configuration across memory controllers. This may result in non-optimal memory performance. Verbs cd version exit show /system1/log1/record42 Targets Properties number=42 severity=Repaired date=05/12/2020 time=01:56:07 description=HPE Ethernet 1Gb 4-port 331i Adapter - NIC Connectivity status changed to OK for adapter in slot 0, port 1 Verbs cd version exit show
This seems the same as T225378#5245612, but a different dimm is complaining this time.
Previous summary:
01:54 <+icinga-wm> PROBLEM - Host db2097 is DOWN: PING CRITICAL - Packet loss = 100% 01:56 <+icinga-wm> RECOVERY - Host db2097 is UP: PING OK - Packet loss = 0%, RTA = 36.19 ms 01:59 <+icinga-wm> PROBLEM - MariaDB read only s6 on db2097 is CRITICAL: Could not connect to localhost:3316 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting 01:59 <+icinga-wm> PROBLEM - MariaDB Slave IO: s1 on db2097 is CRITICAL: CRITICAL slave_io_state could not connect https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_slave 02:00 <+icinga-wm> PROBLEM - MariaDB Slave IO: s6 on db2097 is CRITICAL: CRITICAL slave_io_state could not connect https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_slave 02:00 <+icinga-wm> PROBLEM - mysqld processes on db2097 is CRITICAL: PROCS CRITICAL: 0 processes with command name mysqld https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting 02:00 <+icinga-wm> PROBLEM - MariaDB Slave SQL: s1 on db2097 is CRITICAL: CRITICAL slave_sql_state could not connect https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_slave 02:02 <+icinga-wm> PROBLEM - MariaDB Slave SQL: s6 on db2097 is CRITICAL: CRITICAL slave_sql_state could not connect https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_slave 02:04 <+icinga-wm> PROBLEM - MariaDB read only s1 on db2097 is CRITICAL: Could not connect to localhost:3311 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting 02:10 <+icinga-wm> PROBLEM - MariaDB Slave Lag: s6 on db2097 is CRITICAL: CRITICAL slave_sql_lag could not connect https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_slave