Page MenuHomePhabricator

cp2022 memory replacement
Closed, ResolvedPublic

Description

This task was generated as a sub-task to T190540. On T190540, we discovered multiple cp systems in codfw with memory errors. As part of that testing, it was asked that @Papaul copy down the racadm SEL and then run memtest86. However, all but cp2022 do not have their SEL copied to the task, so it cannot be populated into this task at time of creation.

Please open a warranty replacement for the defective dimms listed below, even though they passed memtest86.

cp2022 SEL
"Normal","Sat May 30 2015 03:52:02","Log cleared."
"Warning","Wed Jun 01 2016 17:39:30","Correctable memory error rate exceeded for DIMM_B2."
"Warning","Wed Jun 01 2016 17:39:35","Correctable memory error rate exceeded for DIMM_B6."
"Critical","Wed Jun 01 2016 17:39:39","Correctable memory error rate exceeded for DIMM_B2."
"Normal","Thu Oct 06 2016 16:26:01","A problem was detected in Memory Reference Code (MRC)."
"Critical","Thu Oct 06 2016 16:26:01","Multi-bit memory errors detected on a memory device at location(s) DIMM_B2."
"Warning","Thu Oct 06 2016 16:26:47","Fan 5A RPM is less than the lower warning threshold."
"Critical","Thu Oct 06 2016 16:26:47","Fan 5A RPM is less than the lower critical threshold."
"Warning","Thu Oct 06 2016 16:26:52","Fan 5A RPM is less than the lower warning threshold."
"Normal","Thu Oct 06 2016 16:26:53","Fan 5A RPM is within range."
"Warning","Thu Oct 06 2016 16:32:15","Correctable memory error rate exceeded for DIMM_B2."
"Warning","Thu Oct 06 2016 16:32:15","Correctable memory error rate exceeded for DIMM_B6."
"Critical","Thu Oct 06 2016 16:32:17","Correctable memory error rate exceeded for DIMM_B2."
"Warning","Thu Oct 06 2016 16:32:33","Fan 5A RPM is less than the lower warning threshold."
"Critical","Thu Oct 06 2016 16:32:33","Fan 5A RPM is less than the lower critical threshold."
"Warning","Thu Oct 06 2016 16:32:38","Fan 5A RPM is less than the lower warning threshold."
"Normal","Thu Oct 06 2016 16:32:38","Fan 5A RPM is within range."
"Warning","Wed Oct 26 2016 16:16:01","Correctable memory error rate exceeded for DIMM_B6."
"Critical","Wed Oct 26 2016 16:16:05","Correctable memory error rate exceeded for DIMM_B6."
"Normal","Wed Feb 08 2017 14:15:06","A problem was detected in Memory Reference Code (MRC)."
"Critical","Wed Feb 08 2017 14:15:06","Multi-bit memory errors detected on a memory device at location(s) DIMM_B2."
"Warning","Wed Feb 08 2017 14:15:51","Fan 5A RPM is less than the lower warning threshold."
"Critical","Wed Feb 08 2017 14:15:51","Fan 5A RPM is less than the lower critical threshold."
"Critical","Wed Feb 08 2017 14:15:55","Fan redundancy is lost."
"Warning","Wed Feb 08 2017 14:15:56","Fan 5A RPM is less than the lower warning threshold."
"Normal","Wed Feb 08 2017 14:15:56","Fan 5A RPM is within range."
"Normal","Wed Feb 08 2017 14:16:00","The fans are redundant."
"Warning","Wed Nov 08 2017 09:28:52","Correctable memory error rate exceeded for DIMM_B2."
"Warning","Wed Nov 08 2017 09:28:53","Correctable memory error rate exceeded for DIMM_B6."
"Critical","Wed Nov 08 2017 09:29:03","Correctable memory error rate exceeded for DIMM_B2."
"Warning","Thu Nov 09 2017 12:34:08","Correctable memory error rate exceeded for DIMM_B2."
"Warning","Thu Nov 09 2017 12:34:08","Correctable memory error rate exceeded for DIMM_B6."
"Critical","Thu Nov 09 2017 12:34:18","Correctable memory error rate exceeded for DIMM_B6."
"Warning","Tue Jan 09 2018 18:11:40","Correctable memory error rate exceeded for DIMM_B2."
"Warning","Tue Jan 09 2018 18:11:46","Correctable memory error rate exceeded for DIMM_B6."
"Critical","Tue Jan 09 2018 18:11:51","Correctable memory error rate exceeded for DIMM_B2."
"Normal","Wed Mar 28 2018 15:44:48","A problem was detected in Memory Reference Code (MRC)."
"Critical","Wed Mar 28 2018 15:44:48","Multi-bit memory errors detected on a memory device at location(s) DIMM_B2."
"Warning","Wed Mar 28 2018 15:45:33","Fan 5A RPM is less than the lower warning threshold."
"Normal","Wed Mar 28 2018 15:45:39","Fan 5A RPM is within range."
"Warning","Wed Mar 28 2018 15:56:53","Correctable memory error rate exceeded for DIMM_B2."
"Critical","Wed Mar 28 2018 15:56:53","Correctable memory error rate exceeded for DIMM_B2."
"Warning","Wed Mar 28 2018 15:56:53","Correctable memory error rate exceeded for DIMM_B6."
"Critical","Wed Mar 28 2018 15:56:53","Correctable memory error rate exceeded for DIMM_B6."

Event Timeline

RobH triaged this task as Normal priority.Apr 2 2018, 5:57 PM
RobH created this task.
ema moved this task from Triage to Hardware on the Traffic board.Apr 3 2018, 4:27 PM
Papaul added a comment.Apr 4 2018, 6:56 PM

Your Service Request
SR#: 963052308

Contact Us | Support Library | Download Center | SupportAssist | Community Forums

Dear Papaul Tshibamba,

Current Status:

This e-mail serves as confirmation that you have successfully scheduled an onsite service appointment for the following time:

4/5/18 8:00am to 6:00pm

What's Next:

Checking on Status or Updating Service Appointment

You can visit the Support History Page at any time to check for status updates and additional information related to both your service request number or dispatch number listed below. Any changes or cancellations to your service appointment must be made by 11:59pm (local time), the business day before your scheduled appointment.

Service Request Information:
Dispatch Information: Customer Information:
Dispatch Number: 351607709
Service Tag: 824BF42
Service Request Number: 963052308
Express Service Code: 17542442306
System Type: POWEREDGE R630,PROWL

Contact Name: Papaul Tshibamba
Alt Contact Name:
Address: 1649 W Frankford Rd Attn Papaul on 214 772 7488,
City, County, Postal Code, Country code: Carrollton, TX, 75007, US

Papaul added a comment.Apr 5 2018, 8:01 PM

The Dell tech call saying he couldn't make it for today. This is now schedule first thing Monday morning .

Papaul added a comment.Apr 9 2018, 6:39 PM

DIMM 6 replaced
DIMM 3 = bad DIMM sent from DELL need replacement again
Fan #5 replaced

RobH added a comment.Apr 10 2018, 4:57 PM

Ok, So I just took this over from Papaul. He replaced the bad memory on the A side earlier today, but after just clearing the log and rebooting, we have more memory errors:

Record: 1
Date/Time: 04/10/2018 16:46:02
Source: system
Severity: Ok

Description: Log cleared.

Record: 2
Date/Time: 04/10/2018 16:53:43
Source: system
Severity: Non-Critical

Description: Correctable memory error rate exceeded for DIMM_B2.

Record: 3
Date/Time: 04/10/2018 16:54:04
Source: system
Severity: Non-Critical

Description: Correctable memory error rate exceeded for DIMM_B6.

Record: 4
Date/Time: 04/10/2018 16:54:16
Source: system
Severity: Critical

Description: Correctable memory error rate exceeded for DIMM_B2.

So it looks like either we have even more bad memory, a bad mainboard, or a combination of the two.

@BBlack we replaced the main board on cp2022 and the new NIC MAC address is:44:A8:42:2D:1E:80

I asked Dell tech to leave the memory for the other 3 servers cp2008, cp2011 and cp2018 with me so when you clear that cp2002 looks good we can go ahead an work on the other servers .

thanks

Note: there is no need to re image the server because the MAC address is the same 44:A8:42:2D:1E:80;

BBlack closed this task as Resolved.Apr 10 2018, 6:37 PM

all green in icinga now and repooled, closing!

Mentioned in SAL (#wikimedia-operations) [2018-04-11T07:45:17Z] <ema> cp2022: restart varnish-be due to child process crash https://phabricator.wikimedia.org/P6979 T191229