Page MenuHomePhabricator

(OoW) wtp2020: correctable memory errors
Open, NormalPublic

Description

Multiple errors logged like this:

Sep 27 10:19:56 wtp2020 kernel: [1270956.777271] EDAC sbridge MC0: HANDLING MCE MEMORY ERROR
Sep 27 10:19:56 wtp2020 kernel: [1270956.777274] EDAC sbridge MC0: CPU 0: Machine Check Event: 0 Bank 7: cc00008000010092
Sep 27 10:19:56 wtp2020 kernel: [1270956.777275] EDAC sbridge MC0: TSC 0
Sep 27 10:19:56 wtp2020 kernel: [1270956.777276] EDAC sbridge MC0: ADDR 747e95ec0
Sep 27 10:19:56 wtp2020 kernel: [1270956.777276] EDAC sbridge MC0: MISC 2140683e00
Sep 27 10:19:56 wtp2020 kernel: [1270956.777278] EDAC sbridge MC0: PROCESSOR 0:306e4 TIME 1538043596 SOCKET 0 APIC 0
Sep 27 10:19:56 wtp2020 kernel: [1270956.777295] EDAC MC0: 2 CE memory read error on CPU_SrcID#0_Ha#0_Chan#2_DIMM#0 (channel:2 slot:0 page:0x747e95 offset:0xec0 grain:32
syndrome:0x0 - OVERFLOW area:DRAM err_code:0001:0092 socket:0 ha:0 channel_mask:4 rank:1)
Sep 27 10:19:56 wtp2020 kernel: [1270956.777296] EDAC sbridge MC0: HANDLING MCE MEMORY ERROR
Sep 27 10:19:56 wtp2020 kernel: [1270956.777297] EDAC sbridge MC0: CPU 0: Machine Check Event: 0 Bank 11: c800008f00800092
Sep 27 10:19:56 wtp2020 kernel: [1270956.777298] EDAC sbridge MC0: TSC 0
Sep 27 10:19:56 wtp2020 kernel: [1270956.777298] EDAC sbridge MC0: ADDR 0
Sep 27 10:19:56 wtp2020 kernel: [1270956.777299] EDAC sbridge MC0: MISC c908400080009e00
Sep 27 10:19:56 wtp2020 kernel: [1270956.777300] EDAC sbridge MC0: PROCESSOR 0:306e4 TIME 1538043596 SOCKET 0 APIC 0
Sep 27 10:19:56 wtp2020 mcelog: warning: 16 bytes ignored in each record
Sep 27 10:19:56 wtp2020 mcelog: consider an update

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptSep 28 2018, 2:02 PM
herron triaged this task as High priority.Oct 2 2018, 5:27 PM

This is back, any chance for reseating or swapping memory @Papaul ?

RobH added a subscriber: RobH.Feb 21 2019, 1:14 AM

So, this has a warranty of Jan. 19, 2018, so it is out of warranty.

Best we can do is see if the slot is bad or dimm, and remove a bad dimm.

RobH added a comment.Feb 21 2019, 1:18 AM

No logged errors in SEL:

4 $> ssh root@wtp2020.mgmt.codfw.wmnet
root@wtp2020.mgmt.codfw.wmnet's password: 
/admin1-> racadm getsel
Record:      1
Date/Time:   01/15/2015 23:03:58
Source:      system
Severity:    Ok
Description: Log cleared.
-------------------------------------------------------------------------------
/admin1->
jijiki added a subscriber: jijiki.Jun 16 2019, 2:21 PM

New alarms going off for this one

[Sun Jun 16 08:30:29 2019] mce: [Hardware Error]: Machine check events logged
[Sun Jun 16 08:30:29 2019] EDAC sbridge MC0: HANDLING MCE MEMORY ERROR
[Sun Jun 16 08:30:29 2019] EDAC sbridge MC0: CPU 0: Machine Check Event: 0 Bank 7: cc00008000010092
[Sun Jun 16 08:30:29 2019] EDAC sbridge MC0: TSC 0
[Sun Jun 16 08:30:29 2019] EDAC sbridge MC0: ADDR 747e95ec0
[Sun Jun 16 08:30:29 2019] EDAC sbridge MC0: MISC 21406ada00
[Sun Jun 16 08:30:29 2019] EDAC sbridge MC0: PROCESSOR 0:306e4 TIME 1560673683 SOCKET 0 APIC 0
[Sun Jun 16 08:30:29 2019] EDAC MC0: 2 CE memory read error on CPU_SrcID#0_Ha#0_Chan#2_DIMM#0 (channel:2 slot:0 page:0x747e95 offset:0xec0 grain:32 syndrome:0x0 -  OVERFLOW area:DRAM err_code:0001:0092 socket:0 ha:0 channel_mask:4 rank:1)
[Sun Jun 16 08:30:29 2019] EDAC sbridge MC0: HANDLING MCE MEMORY ERROR
[Sun Jun 16 08:30:29 2019] EDAC sbridge MC0: CPU 0: Machine Check Event: 0 Bank 11: c800008f00800092
[Sun Jun 16 08:30:29 2019] EDAC sbridge MC0: TSC 0
[Sun Jun 16 08:30:29 2019] EDAC sbridge MC0: ADDR 0
[Sun Jun 16 08:30:29 2019] EDAC sbridge MC0: MISC c908400080009e8c
[Sun Jun 16 08:30:29 2019] EDAC sbridge MC0: PROCESSOR 0:306e4 TIME 1560673683 SOCKET 0 APIC 0
wiki_willy renamed this task from wtp2020: correctable memory errors to (OoW) wtp2020: correctable memory errors.Jul 15 2019, 8:47 PM
wiki_willy assigned this task to Papaul.
Papaul lowered the priority of this task from High to Normal.Jul 17 2019, 5:54 PM
Papaul closed this task as Resolved.Jul 18 2019, 2:42 PM

No errors showing in log and all Hardware showing green, firmware is at version 2.6. we can resolve this task for now and reopen in case we have the issue again.

And again :)

This system is still showing no sign of any hardware issue in the log . if there is any hardware issue going on, it is no been logged. Can someone look at the OS level to see if there is anything that can help?

Dzahn added a subscriber: Dzahn.Oct 18 2019, 10:30 PM

from dmesg | grep EDAC

[12639234.186693] EDAC sbridge MC0: HANDLING MCE MEMORY ERROR
[12639234.186694] EDAC sbridge MC0: CPU 0: Machine Check Event: 0 Bank 11: 8800004f00800092
[12639234.186695] EDAC sbridge MC0: TSC 0 
[12639234.186695] EDAC sbridge MC0: ADDR 0 
[12639234.186696] EDAC sbridge MC0: MISC 4908400080009e8c 
[12639234.186697] EDAC sbridge MC0: PROCESSOR 0:306e4 TIME 1571091994 SOCKET 0 APIC 0
[12639235.542113] EDAC sbridge MC0: HANDLING MCE MEMORY ERROR
[12639235.542115] EDAC sbridge MC0: CPU 0: Machine Check Event: 0 Bank 7: 8c00004000010092
[12639235.542124] EDAC sbridge MC0: TSC 0 
[12639235.542127] EDAC sbridge MC0: ADDR 747e95ec0 
[12639235.542127] EDAC sbridge MC0: MISC 14268ae00 
[12639235.542129] EDAC sbridge MC0: PROCESSOR 0:306e4 TIME 1571091995 SOCKET 0 APIC 0
[12639235.542143] EDAC MC0: 1 CE memory read error on CPU_SrcID#0_Ha#0_Chan#2_DIMM#0 (channel:2 slot:0 page:0x747e95 offset:0xec0 grain:32 syndrome:0x0 -  area:DRAM err_code:0001:0092 socket:0 ha:0 channel_mask:4 rank:1)
[12639235.542144] EDAC sbridge MC0: HANDLING MCE MEMORY ERROR
[12639235.542145] EDAC sbridge MC0: CPU 0: Machine Check Event: 0 Bank 11: 8800004f00800092
[12639235.542145] EDAC sbridge MC0: TSC 0 
[12639235.542146] EDAC sbridge MC0: ADDR 0 
[12639235.542147] EDAC sbridge MC0: MISC 4908400080009e8c 
[12639235.542148] EDAC sbridge MC0: PROCESSOR 0:306e4 TIME 1571091995 SOCKET 0 APIC 0
[12854725.652776] EDAC sbridge MC0: HANDLING MCE MEMORY ERROR
[12854725.652778] EDAC sbridge MC0: CPU 0: Machine Check Event: 0 Bank 11: 8c00004f000800c2
[12854725.652779] EDAC sbridge MC0: TSC 0 
[12854725.652780] EDAC sbridge MC0: ADDR 747e95000 
[12854725.652788] EDAC sbridge MC0: MISC 908400080009e8c 
[12854725.652789] EDAC sbridge MC0: PROCESSOR 0:306e4 TIME 1571307471 SOCKET 0 APIC 0
[12854725.652803] EDAC MC0: 1 CE memory scrubbing error on CPU_SrcID#0_Ha#0_Chan#1_DIMM#0 (channel:1 slot:0 page:0x747e95 offset:0x0 grain:32 syndrome:0x0 -  area:DRAM err_code:0008:00c2 socket:0 ha:0 channel_mask:2 rank:1)
[12958498.065904] EDAC sbridge MC0: HANDLING MCE MEMORY ERROR
[12958498.065907] EDAC sbridge MC0: CPU 0: Machine Check Event: 0 Bank 11: 8c00004f000800c2
[12958498.065913] EDAC sbridge MC0: TSC 0 
[12958498.065915] EDAC sbridge MC0: ADDR 747e95000 
[12958498.065915] EDAC sbridge MC0: MISC 908400080009e8c 
[12958498.065917] EDAC sbridge MC0: PROCESSOR 0:306e4 TIME 1571411237 SOCKET 0 APIC 0
[12958498.065931] EDAC MC0: 1 CE memory scrubbing error on CPU_SrcID#0_Ha#0_Chan#1_DIMM#0 (channel:1 slot:0 page:0x747e95 offset:0x0 grain:32 syndrome:0x0 -  area:DRAM err_code:0008:00c2 socket:0 ha:0 channel_mask:2 rank:1)
ayounsi removed a subscriber: ayounsi.Oct 19 2019, 8:04 AM

Mentioned in SAL (#wikimedia-operations) [2019-11-07T16:58:50Z] <mutante> wtp2020 - depooled for T205712

Mentioned in SAL (#wikimedia-operations) [2019-11-07T17:00:37Z] <mutante> wtp2020 - 2 hours downtime - shut down (T205712) - go ahead @Papaul

Papaul added a comment.Thu, Nov 7, 5:50 PM

Running ePSA on the system

Papaul added a comment.Thu, Nov 7, 8:13 PM

EPSA pass with no errors

RobH removed a subscriber: RobH.Thu, Nov 7, 9:50 PM