Page MenuHomePhabricator

cp2008 memory replacement
Closed, ResolvedPublic

Assigned To
Authored By
RobH
Apr 2 2018, 5:57 PM
Referenced Files
F16764489: image001.jpg
Apr 6 2018, 7:21 PM
F16764495: image004.jpg
Apr 6 2018, 7:21 PM
F16684552: cp2018.JPG
Apr 5 2018, 12:47 AM
F16684652: cp2018.JPG
Apr 5 2018, 12:47 AM
F16684704: cp2011.JPG
Apr 5 2018, 12:47 AM
F16600427: TSR20180402182618_7M5FF42.zip
Apr 2 2018, 6:32 PM

Description

This task was generated as a sub-task to T190540. On T190540, we discovered multiple cp systems in codfw with memory errors.

cp2008 has reported multiple dimm errors in the SEL, but passed memtest86. We should still swap out the bad dimms. Since the SEL wasn't copied before clear, @RobH parsed the TSR (support report) to pull out historical SEL entries.

The dimms to replace are: DIMM_A2, DIMM_B2, DIMM_B6.

Below are selected entries pulled out of the TSR:

<Event AgentID="SEL" Category="System Health" Severity="Critical" Timestamp="2015-10-09T19:33:56-0500" Sequence="51">
  <Message>Correctable memory error rate exceeded for DIMM_A2.</Message>
  <MessageID>MEM0702</MessageID>
  <FQDD>DIMM.Socket.A2</FQDD>
  <MessageArgs>
    <Arg>DIMM_A2</Arg>
  </MessageArgs>
  <RawEventData>0x03,0x00,0x02,0xA4,0x16,0x18,0x56,0xB1,0x00,0x04,0x0C,0x1B,0x07,0xA2,0xC0,0x02</RawEventData>
  <Comment/>
</Event>
<Event AgentID="SEL" Category="System Health" Severity="Warning" Timestamp="2016-02-07T04:12:53-0600" Sequence="80">
  <Message>Correctable memory error rate exceeded for DIMM_B2.</Message>
  <MessageID>MEM0701</MessageID>
  <FQDD>DIMM.Socket.B2</FQDD>
  <MessageArgs>
    <Arg>DIMM_B2</Arg>
  </MessageArgs>
  <RawEventData>0x06,0x00,0x02,0x45,0xC4,0xB6,0x56,0xB1,0x00,0x04,0x0C,0x1B,0x07,0xA1,0xC1,0x20</RawEventData>
  <Comment/>
</Event>
<Event AgentID="SEL" Category="System Health" Severity="Warning" Timestamp="2016-02-07T04:13:01-0600" Sequence="81">
  <Message>Correctable memory error rate exceeded for DIMM_B6.</Message>
  <MessageID>MEM0701</MessageID>
  <FQDD>DIMM.Socket.B6</FQDD>
  <MessageArgs>
    <Arg>DIMM_B6</Arg>
  </MessageArgs>
  <RawEventData>0x07,0x00,0x02,0x4D,0xC4,0xB6,0x56,0xB1,0x00,0x04,0x0C,0x1B,0x07,0xA1,0xC2,0x02</RawEventData>
  <Comment/>
</Event>

Event Timeline

RobH triaged this task as Medium priority.Apr 2 2018, 5:57 PM
RobH created this task.
RobH raised the priority of this task from Medium to High.Apr 2 2018, 6:32 PM
RobH updated the task description. (Show Details)

Hello Papaul,

Thank you for replying.

We have created the following cases:

  1. 7M99F42 - 963059814 – 351594756 (Dispatch) ----- Motherboard and 2 DIMMs
  1. 7M6CF42 - 963061588 – 351604843 (Dispatch) ----- Motherboard and 2 DIMMs
  1. 824FF42 - 963052179 – 351606240 (Dispatch) ----- Motherboard and 3 DIMMs
  1. 824BF42 - 963052308 – 351607709 (Dispatch) ----- Motherboard, 2 DIMMs and 1 Fan

As discussed, we have booked an Engineer call for the replacement of system board under the dispatch #.

I have also ensured that the Engineer will be contacting you in advance to ensure smooth completion of the service call. The service is expected to happen tomorrow between 9am-5:30pm, barring any excessive weather or part backorder delays.

Regarding the remaining 3 servers, we are not sure what part to be booked as the logs are very old as shown below:

A.

cp2018.JPG (491×542 px, 55 KB)

B.

cp2018.JPG (491×542 px, 55 KB)

C.

cp2011.JPG (412×526 px, 46 KB)

Recommendation:

• Clear the SEL log
• Update the BIOS and iDRAC firmware
• Share us the TSR again.

@ema @BBlack I need to update BIOS and IDRAC on cp2008 as requested by Dell. Can you please depool the server since it is not showing depooled on the sheet.

Thanks.

This comment was removed by RobH.

Old comment removed due to IRC discussion, Papaul points out they aren't refusing the replacement, but just want updated TSR.

cp2008 has been shutdown. it can be flashed and updated as needed. Once it is done, we'll need to coordinate with traffic team to bring it fully online before moving on to the next cp system firmware update (for online systems).

BIOS and IDRAC update complete

Hello Papaul,

Thank you for sharing the log.

I am currently in Training, however I got a chance to look at the TSR and analyzed it.

We do see that the firmware on the server have been updated to the latest. However, we do not see any recent hardware error in the log as shown below.

image001.jpg (175×919 px, 31 KB)

You can also see the current Memory status (All Good) on the server as shown below:

image004.jpg (348×922 px, 60 KB)

At this moment we do not see any requirement for the memory to be replaced. However, please let us know if you see any Memory error during POST.

Thank you for your understanding.

@Papaul: Please advise to Dell that we saw the error in the logs we provided, and we aren't willing to use the faulty hardware in production without replacement of the memory modules at a minimum. The errors happened, and that should be good enough for replacement memory.

If they still refuse, please let me know, and we'll be escalating this up to our account team.

DIMM A2 replaced
DIMM B2 replaced
DIMM B6 replaced

system has been pushed back into service with the new memory in use

also note I rebooted cp2008 into the post and debian kernel selection screen 7 times, without any memory post errors.