Page MenuHomePhabricator

Broken memory on thumbor1004
Closed, DuplicatePublic

Description

Icinga is reporting broken memory on thumbor1004:

Oct 23 06:43:09 thumbor1004 kernel: [586941.707780] mce_notify_irq: 1 callbacks suppressed
Oct 23 06:43:09 thumbor1004 kernel: [586941.707780] mce: [Hardware Error]: Machine check events logged
Oct 23 06:43:09 thumbor1004 kernel: [586941.707792] EDAC sbridge MC1: HANDLING MCE MEMORY ERROR
Oct 23 06:43:09 thumbor1004 kernel: [586941.707794] EDAC sbridge MC1: CPU 1: Machine Check Event: 0 Bank 10: 8c000050000800c1
Oct 23 06:43:09 thumbor1004 kernel: [586941.707800] EDAC sbridge MC1: TSC 0
Oct 23 06:43:09 thumbor1004 kernel: [586941.707805] EDAC sbridge MC1: ADDR cc68f5000
Oct 23 06:43:09 thumbor1004 kernel: [586941.707809] EDAC sbridge MC1: MISC 90842000200208c
Oct 23 06:43:09 thumbor1004 kernel: [586941.707811] EDAC sbridge MC1: PROCESSOR 0:306e4 TIME 1540276989 SOCKET 1 APIC 20
Oct 23 06:43:09 thumbor1004 kernel: [586941.707825] EDAC MC1: 1 CE memory scrubbing error on CPU_SrcID#1_Ha#0_Chan#1_DIMM#0 (channel:1 slot:0 page:0xcc68f5 offset:0x0 grain:32 syndrome:0x0 -  area:DRAM err_code:0008:00c1 socket:1 ha:0 channel_mask:2 rank:1)
Oct 23 06:43:09 thumbor1004 mcelog: warning: 16 bytes ignored in each record

Host is OOW, but maybe we have a compatible DIMM module from a decommissioned server?

This was actually logged in racadm dating back to 2017, but the Icinga check didn't exist back then. If so, we could have even gotten the DIMM replaced while under warranty.

Record:      2
Date/Time:   04/28/2017 00:35:08
Source:      system
Severity:    Non-Critical
Description: Correctable memory error rate exceeded for DIMM_B1.

Event Timeline

@MoritzMuehlenhoff I am sure I have one buried in the 300 servers on the floor but the few that are easy to access are only 8GB.

Mentioned in SAL (#wikimedia-operations) [2019-01-09T23:51:58Z] <mutante> thumb1004 - still needs broken RAM replaced, expired downtime, re-ACKed (T207721)

This expired back in 2017, shouldn't we just replace it rather than repair?

Typically that has been the standard response.