Page MenuHomePhabricator

Replace memory bank on scb1002
Closed, ResolvedPublic

Description

We are seeing repeated EDAC errors in the logs

8848485.132443] EDAC MC1: 1 CE memory read error on CPU_SrcID#1_Ha#0_Chan#3_DIMM#0 (channel:3 slot:0 page:0x4e78bf offset:0xb40 grain:32 syndrome:0x0 -  area:DRAM err_code:0001:0093 socket:1 ha:0 channel_mask:8 rank:0)
[8848485.132444] EDAC sbridge MC1: HANDLING MCE MEMORY ERROR
[8848485.132445] EDAC sbridge MC1: CPU 1: Machine Check Event: 0 Bank 11: 8800004400800093
[8848485.132446] EDAC sbridge MC1: TSC 0 
[8848485.132446] EDAC sbridge MC1: ADDR 0 
[8848485.132447] EDAC sbridge MC1: MISC 4900040004000800 
[8848485.132448] EDAC sbridge MC1: PROCESSOR 0:206d7 TIME 1528575500 SOCKET 1 APIC 20

we should identify which RAM bank is faulty and swap it for a spare one, if we have any.

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald Transcript
Joe triaged this task as Low priority.Jun 18 2018, 10:49 AM

the hardware log does not show any indication of a bad DIMM I can probably pull from a decommissioned spare

/admin1-> racadm getsel
Record: 1
Date/Time: 01/11/2013 00:34:55
Source: system
Severity: Ok

Description: Log cleared.

Record: 2
Date/Time: 08/06/2015 16:14:54
Source: system
Severity: Critical

Description: The chassis is open while the power is off.

Record: 3
Date/Time: 08/06/2015 16:14:59
Source: system
Severity: Ok

Description: The chassis is closed while the power is off.

/admin1->

@Joe can you stress the DIMM? A simple reseating of the DIMM may also work. Let me know if I can power it down and do that.

Thanks

Vvjjkkii renamed this task from Replace memory bank on scb1002 to u9aaaaaaaa.Jul 1 2018, 1:05 AM
Vvjjkkii raised the priority of this task from Low to High.
Vvjjkkii updated the task description. (Show Details)
Vvjjkkii removed a subscriber: Aklapper.

@Joe please let me know when it's okay to take this down? We can schedule for Tuesday 7 August.

@Cmjohnson Let me know the next time you're in the DC and I'll disable the host for diagnostics.

I misread the memory and didn't have 4GB so i replaced all of the memory with same type just 8GB DIMM in each socket. So the server now had 2X the memory it once had.

MoritzMuehlenhoff claimed this task.

Thanks, I've repooled the server. Closing the task, will reopen in case there are still issues.