Icinga is reporting broken memory on thumbor1004:
Oct 23 06:43:09 thumbor1004 kernel: [586941.707780] mce_notify_irq: 1 callbacks suppressed Oct 23 06:43:09 thumbor1004 kernel: [586941.707780] mce: [Hardware Error]: Machine check events logged Oct 23 06:43:09 thumbor1004 kernel: [586941.707792] EDAC sbridge MC1: HANDLING MCE MEMORY ERROR Oct 23 06:43:09 thumbor1004 kernel: [586941.707794] EDAC sbridge MC1: CPU 1: Machine Check Event: 0 Bank 10: 8c000050000800c1 Oct 23 06:43:09 thumbor1004 kernel: [586941.707800] EDAC sbridge MC1: TSC 0 Oct 23 06:43:09 thumbor1004 kernel: [586941.707805] EDAC sbridge MC1: ADDR cc68f5000 Oct 23 06:43:09 thumbor1004 kernel: [586941.707809] EDAC sbridge MC1: MISC 90842000200208c Oct 23 06:43:09 thumbor1004 kernel: [586941.707811] EDAC sbridge MC1: PROCESSOR 0:306e4 TIME 1540276989 SOCKET 1 APIC 20 Oct 23 06:43:09 thumbor1004 kernel: [586941.707825] EDAC MC1: 1 CE memory scrubbing error on CPU_SrcID#1_Ha#0_Chan#1_DIMM#0 (channel:1 slot:0 page:0xcc68f5 offset:0x0 grain:32 syndrome:0x0 - area:DRAM err_code:0008:00c1 socket:1 ha:0 channel_mask:2 rank:1) Oct 23 06:43:09 thumbor1004 mcelog: warning: 16 bytes ignored in each record
Host is OOW, but maybe we have a compatible DIMM module from a decommissioned server?
This was actually logged in racadm dating back to 2017, but the Icinga check didn't exist back then. If so, we could have even gotten the DIMM replaced while under warranty.
Record: 2 Date/Time: 04/28/2017 00:35:08 Source: system Severity: Non-Critical Description: Correctable memory error rate exceeded for DIMM_B1.