Broken memory on thumbor1004
Open, NormalPublic

Description

Icinga is reporting broken memory on thumbor1004:

Oct 23 06:43:09 thumbor1004 kernel: [586941.707780] mce_notify_irq: 1 callbacks suppressed
Oct 23 06:43:09 thumbor1004 kernel: [586941.707780] mce: [Hardware Error]: Machine check events logged
Oct 23 06:43:09 thumbor1004 kernel: [586941.707792] EDAC sbridge MC1: HANDLING MCE MEMORY ERROR
Oct 23 06:43:09 thumbor1004 kernel: [586941.707794] EDAC sbridge MC1: CPU 1: Machine Check Event: 0 Bank 10: 8c000050000800c1
Oct 23 06:43:09 thumbor1004 kernel: [586941.707800] EDAC sbridge MC1: TSC 0
Oct 23 06:43:09 thumbor1004 kernel: [586941.707805] EDAC sbridge MC1: ADDR cc68f5000
Oct 23 06:43:09 thumbor1004 kernel: [586941.707809] EDAC sbridge MC1: MISC 90842000200208c
Oct 23 06:43:09 thumbor1004 kernel: [586941.707811] EDAC sbridge MC1: PROCESSOR 0:306e4 TIME 1540276989 SOCKET 1 APIC 20
Oct 23 06:43:09 thumbor1004 kernel: [586941.707825] EDAC MC1: 1 CE memory scrubbing error on CPU_SrcID#1_Ha#0_Chan#1_DIMM#0 (channel:1 slot:0 page:0xcc68f5 offset:0x0 grain:32 syndrome:0x0 -  area:DRAM err_code:0008:00c1 socket:1 ha:0 channel_mask:2 rank:1)
Oct 23 06:43:09 thumbor1004 mcelog: warning: 16 bytes ignored in each record

Host is OOW, but maybe we have a compatible DIMM module from a decommissioned server?

This was actually logged in racadm dating back to 2017, but the Icinga check didn't exist back then. If so, we could have even gotten the DIMM replaced while under warranty.

Record:      2
Date/Time:   04/28/2017 00:35:08
Source:      system
Severity:    Non-Critical
Description: Correctable memory error rate exceeded for DIMM_B1.
MoritzMuehlenhoff triaged this task as Normal priority.Oct 23 2018, 2:21 PM

@MoritzMuehlenhoff I am sure I have one buried in the 300 servers on the floor but the few that are easy to access are only 8GB.

Cmjohnson moved this task from Backlog to Not urgent on the ops-eqiad board.Oct 25 2018, 3:14 PM

Mentioned in SAL (#wikimedia-operations) [2019-01-09T23:51:58Z] <mutante> thumb1004 - still needs broken RAM replaced, expired downtime, re-ACKed (T207721)