Page MenuHomePhabricator

(OoW) wtp2020: correctable memory errors
Closed, ResolvedPublic

Description

Multiple errors logged like this:

Sep 27 10:19:56 wtp2020 kernel: [1270956.777271] EDAC sbridge MC0: HANDLING MCE MEMORY ERROR
Sep 27 10:19:56 wtp2020 kernel: [1270956.777274] EDAC sbridge MC0: CPU 0: Machine Check Event: 0 Bank 7: cc00008000010092
Sep 27 10:19:56 wtp2020 kernel: [1270956.777275] EDAC sbridge MC0: TSC 0
Sep 27 10:19:56 wtp2020 kernel: [1270956.777276] EDAC sbridge MC0: ADDR 747e95ec0
Sep 27 10:19:56 wtp2020 kernel: [1270956.777276] EDAC sbridge MC0: MISC 2140683e00
Sep 27 10:19:56 wtp2020 kernel: [1270956.777278] EDAC sbridge MC0: PROCESSOR 0:306e4 TIME 1538043596 SOCKET 0 APIC 0
Sep 27 10:19:56 wtp2020 kernel: [1270956.777295] EDAC MC0: 2 CE memory read error on CPU_SrcID#0_Ha#0_Chan#2_DIMM#0 (channel:2 slot:0 page:0x747e95 offset:0xec0 grain:32
syndrome:0x0 - OVERFLOW area:DRAM err_code:0001:0092 socket:0 ha:0 channel_mask:4 rank:1)
Sep 27 10:19:56 wtp2020 kernel: [1270956.777296] EDAC sbridge MC0: HANDLING MCE MEMORY ERROR
Sep 27 10:19:56 wtp2020 kernel: [1270956.777297] EDAC sbridge MC0: CPU 0: Machine Check Event: 0 Bank 11: c800008f00800092
Sep 27 10:19:56 wtp2020 kernel: [1270956.777298] EDAC sbridge MC0: TSC 0
Sep 27 10:19:56 wtp2020 kernel: [1270956.777298] EDAC sbridge MC0: ADDR 0
Sep 27 10:19:56 wtp2020 kernel: [1270956.777299] EDAC sbridge MC0: MISC c908400080009e00
Sep 27 10:19:56 wtp2020 kernel: [1270956.777300] EDAC sbridge MC0: PROCESSOR 0:306e4 TIME 1538043596 SOCKET 0 APIC 0
Sep 27 10:19:56 wtp2020 mcelog: warning: 16 bytes ignored in each record
Sep 27 10:19:56 wtp2020 mcelog: consider an update

Event Timeline

herron triaged this task as High priority.Oct 2 2018, 5:27 PM

This is back, any chance for reseating or swapping memory @Papaul ?

So, this has a warranty of Jan. 19, 2018, so it is out of warranty.

Best we can do is see if the slot is bad or dimm, and remove a bad dimm.

No logged errors in SEL:

4 $> ssh root@wtp2020.mgmt.codfw.wmnet
root@wtp2020.mgmt.codfw.wmnet's password: 
/admin1-> racadm getsel
Record:      1
Date/Time:   01/15/2015 23:03:58
Source:      system
Severity:    Ok
Description: Log cleared.
-------------------------------------------------------------------------------
/admin1->

New alarms going off for this one

[Sun Jun 16 08:30:29 2019] mce: [Hardware Error]: Machine check events logged
[Sun Jun 16 08:30:29 2019] EDAC sbridge MC0: HANDLING MCE MEMORY ERROR
[Sun Jun 16 08:30:29 2019] EDAC sbridge MC0: CPU 0: Machine Check Event: 0 Bank 7: cc00008000010092
[Sun Jun 16 08:30:29 2019] EDAC sbridge MC0: TSC 0
[Sun Jun 16 08:30:29 2019] EDAC sbridge MC0: ADDR 747e95ec0
[Sun Jun 16 08:30:29 2019] EDAC sbridge MC0: MISC 21406ada00
[Sun Jun 16 08:30:29 2019] EDAC sbridge MC0: PROCESSOR 0:306e4 TIME 1560673683 SOCKET 0 APIC 0
[Sun Jun 16 08:30:29 2019] EDAC MC0: 2 CE memory read error on CPU_SrcID#0_Ha#0_Chan#2_DIMM#0 (channel:2 slot:0 page:0x747e95 offset:0xec0 grain:32 syndrome:0x0 -  OVERFLOW area:DRAM err_code:0001:0092 socket:0 ha:0 channel_mask:4 rank:1)
[Sun Jun 16 08:30:29 2019] EDAC sbridge MC0: HANDLING MCE MEMORY ERROR
[Sun Jun 16 08:30:29 2019] EDAC sbridge MC0: CPU 0: Machine Check Event: 0 Bank 11: c800008f00800092
[Sun Jun 16 08:30:29 2019] EDAC sbridge MC0: TSC 0
[Sun Jun 16 08:30:29 2019] EDAC sbridge MC0: ADDR 0
[Sun Jun 16 08:30:29 2019] EDAC sbridge MC0: MISC c908400080009e8c
[Sun Jun 16 08:30:29 2019] EDAC sbridge MC0: PROCESSOR 0:306e4 TIME 1560673683 SOCKET 0 APIC 0
wiki_willy renamed this task from wtp2020: correctable memory errors to (OoW) wtp2020: correctable memory errors.Jul 15 2019, 8:47 PM
wiki_willy assigned this task to Papaul.
Papaul lowered the priority of this task from High to Medium.Jul 17 2019, 5:54 PM

No errors showing in log and all Hardware showing green, firmware is at version 2.6. we can resolve this task for now and reopen in case we have the issue again.

This system is still showing no sign of any hardware issue in the log . if there is any hardware issue going on, it is no been logged. Can someone look at the OS level to see if there is anything that can help?

from dmesg | grep EDAC

[12639234.186693] EDAC sbridge MC0: HANDLING MCE MEMORY ERROR
[12639234.186694] EDAC sbridge MC0: CPU 0: Machine Check Event: 0 Bank 11: 8800004f00800092
[12639234.186695] EDAC sbridge MC0: TSC 0 
[12639234.186695] EDAC sbridge MC0: ADDR 0 
[12639234.186696] EDAC sbridge MC0: MISC 4908400080009e8c 
[12639234.186697] EDAC sbridge MC0: PROCESSOR 0:306e4 TIME 1571091994 SOCKET 0 APIC 0
[12639235.542113] EDAC sbridge MC0: HANDLING MCE MEMORY ERROR
[12639235.542115] EDAC sbridge MC0: CPU 0: Machine Check Event: 0 Bank 7: 8c00004000010092
[12639235.542124] EDAC sbridge MC0: TSC 0 
[12639235.542127] EDAC sbridge MC0: ADDR 747e95ec0 
[12639235.542127] EDAC sbridge MC0: MISC 14268ae00 
[12639235.542129] EDAC sbridge MC0: PROCESSOR 0:306e4 TIME 1571091995 SOCKET 0 APIC 0
[12639235.542143] EDAC MC0: 1 CE memory read error on CPU_SrcID#0_Ha#0_Chan#2_DIMM#0 (channel:2 slot:0 page:0x747e95 offset:0xec0 grain:32 syndrome:0x0 -  area:DRAM err_code:0001:0092 socket:0 ha:0 channel_mask:4 rank:1)
[12639235.542144] EDAC sbridge MC0: HANDLING MCE MEMORY ERROR
[12639235.542145] EDAC sbridge MC0: CPU 0: Machine Check Event: 0 Bank 11: 8800004f00800092
[12639235.542145] EDAC sbridge MC0: TSC 0 
[12639235.542146] EDAC sbridge MC0: ADDR 0 
[12639235.542147] EDAC sbridge MC0: MISC 4908400080009e8c 
[12639235.542148] EDAC sbridge MC0: PROCESSOR 0:306e4 TIME 1571091995 SOCKET 0 APIC 0
[12854725.652776] EDAC sbridge MC0: HANDLING MCE MEMORY ERROR
[12854725.652778] EDAC sbridge MC0: CPU 0: Machine Check Event: 0 Bank 11: 8c00004f000800c2
[12854725.652779] EDAC sbridge MC0: TSC 0 
[12854725.652780] EDAC sbridge MC0: ADDR 747e95000 
[12854725.652788] EDAC sbridge MC0: MISC 908400080009e8c 
[12854725.652789] EDAC sbridge MC0: PROCESSOR 0:306e4 TIME 1571307471 SOCKET 0 APIC 0
[12854725.652803] EDAC MC0: 1 CE memory scrubbing error on CPU_SrcID#0_Ha#0_Chan#1_DIMM#0 (channel:1 slot:0 page:0x747e95 offset:0x0 grain:32 syndrome:0x0 -  area:DRAM err_code:0008:00c2 socket:0 ha:0 channel_mask:2 rank:1)
[12958498.065904] EDAC sbridge MC0: HANDLING MCE MEMORY ERROR
[12958498.065907] EDAC sbridge MC0: CPU 0: Machine Check Event: 0 Bank 11: 8c00004f000800c2
[12958498.065913] EDAC sbridge MC0: TSC 0 
[12958498.065915] EDAC sbridge MC0: ADDR 747e95000 
[12958498.065915] EDAC sbridge MC0: MISC 908400080009e8c 
[12958498.065917] EDAC sbridge MC0: PROCESSOR 0:306e4 TIME 1571411237 SOCKET 0 APIC 0
[12958498.065931] EDAC MC0: 1 CE memory scrubbing error on CPU_SrcID#0_Ha#0_Chan#1_DIMM#0 (channel:1 slot:0 page:0x747e95 offset:0x0 grain:32 syndrome:0x0 -  area:DRAM err_code:0008:00c2 socket:0 ha:0 channel_mask:2 rank:1)

Mentioned in SAL (#wikimedia-operations) [2019-11-07T17:00:37Z] <mutante> wtp2020 - 2 hours downtime - shut down (T205712) - go ahead @Papaul

Closing this task since we haven;t seen any errors since November