Page MenuHomePhabricator

cp[2006,2008,2010-2011,2017-2018,2022].codfw.wmnet: Uncorrectable Memory Error
Closed, ResolvedPublic

Description

While rebooting cp2006 today:

Enumerating Boot options...                                                                                                                                   
Enumerating Boot options... Done                                                                                                                              
                                                                                                                                                              
UEFI0107: One or more memory errors have occurred on memory slot: B2.                                                                                         
Remove input power to the system, reseat the DIMM module and restart the                                                                                      
system. If the issues persist, replace the faulty memory module identified in                                                                                 
the message.                                                                                                                                                  

UEFI0058: Uncorrectable Memory Error has occurred because a Dual Inline Memory                                                                                
Module (DIMM) is not functioning.                                                                                                                             
Check the System Event Log (SEL) to identify the non-functioning DIMM, and then              
replace it.

SEL features a few related events, one of which today:

$ sudo ipmi-sel -v | grep memory                                                                                                                              
8   | Feb-05-2016 | 19:05:57 | ECC Uncorr Err   | Memory                   | Assertion Event   | Uncorrectable memory error ; OEM Event Data2 code = C1h ; OEM Event Data3 code = 20h
13  | Jun-01-2016 | 17:52:21 | ECC Uncorr Err   | Memory                   | Assertion Event   | Uncorrectable memory error ; OEM Event Data2 code = C1h ; OEM Event Data3 code = 20h
21  | Mar-23-2018 | 16:55:08 | ECC Uncorr Err   | Memory                   | Assertion Event   | Uncorrectable memory error ; OEM Event Data2 code = C1h ; OEM Event Data3 code = 20h

The same thing happened on cp2010, SEL follows:

15  | Mar-23-2018 | 17:41:46 | ECC Uncorr Err   | Memory                   | Assertion Event   | Uncorrectable memory error ; OEM Event Data2 code = C1h ; OEM Event Data3 code = 20h

Memory Testing for cp2001-cp2026

Due to the large number of memory failures in this batch (they were all ordered together, 26, on RT#9336), we'll need to step through the systems and test memory across the order.

We'll also have an email with Dell, in an attempt to resolve this situation. However, they are unlikely to want to foot the bill on proactively swapping all 416 16GB dimms in that 26 system quantity order.

Memory testing proceedure:

All tests should be recorded on this google sheet.

  • start at cp2001, and work upwards.
    • Systems that are already depooled can be tested at any time (suggest testing all depooled first, then starting at cp2001 and working upwards.)
    • Systems that are perma-depooled as spare can be tested at any time.
  • of the pooled systems, do not take down more than 1 server per pool (text/upload).
  • systems will be automatically depooled by pybal/confctl, no need to manually depool.
  • copy down any errors in the SEL before memory testing, the SEL has to be wiped prior to tests (or it will report on SEL errors.) (command: racadm getsel)
  • clear out the SEL on drac (command: racadm delsel)
  • run memory tests on system, monitor result and update this google sheet.
  • if system passes memory testing, reboot it back into its OS and note it on the google sheet. coordinate in #wikimedia-traffic to bring servers on and offline. Do not take down another system until we confirm the system just tested is back online.

Once we have memtesting of the entire batch, we'll have a better idea on how widespread this issue is.

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald Transcript
ema triaged this task as Medium priority.Mar 23 2018, 5:05 PM
ema renamed this task from cp2006: Uncorrectable Memory Error to cp2006, cp2010: Uncorrectable Memory Error.Mar 23 2018, 5:48 PM
ema updated the task description. (Show Details)

Depooled both today, we should do that in general as these arise.

ema renamed this task from cp2006, cp2010: Uncorrectable Memory Error to cp2006, cp2010, cp2017: Uncorrectable Memory Error.Mar 27 2018, 4:16 PM

Same issue on cp2017 today. Host depooled.

6   | Sep-28-2015 | 20:10:59 | ECC Uncorr Err   | Memory                   | Uncorrectable memory error ; OEM Event Data2 code = C0h ; OEM Event Data3 code = 02h
8   | Sep-28-2015 | 20:14:02 | ECC Uncorr Err   | Memory                   | Uncorrectable memory error ; OEM Event Data2 code = C0h ; OEM Event Data3 code = 02h
9   | Sep-28-2015 | 20:14:02 | ECC Uncorr Err   | Memory                   | Uncorrectable memory error ; OEM Event Data2 code = C0h ; OEM Event Data3 code = 02h
15  | Jun-01-2016 | 17:26:58 | ECC Uncorr Err   | Memory                   | Uncorrectable memory error ; OEM Event Data2 code = C0h ; OEM Event Data3 code = 08h
23  | Oct-06-2016 | 17:06:34 | ECC Uncorr Err   | Memory                   | Uncorrectable memory error ; OEM Event Data2 code = C0h ; OEM Event Data3 code = 02h
25  | Oct-06-2016 | 17:06:34 | ECC Uncorr Err   | Memory                   | Uncorrectable memory error ; OEM Event Data2 code = C0h ; OEM Event Data3 code = 08h
33  | Feb-08-2017 | 16:13:38 | ECC Uncorr Err   | Memory                   | Uncorrectable memory error ; OEM Event Data2 code = C0h ; OEM Event Data3 code = 08h
36  | Feb-25-2017 | 19:29:53 | ECC Uncorr Err   | Memory                   | Uncorrectable memory error ; OEM Event Data2 code = C0h ; OEM Event Data3 code = 02h
38  | Feb-25-2017 | 19:31:44 | ECC Uncorr Err   | Memory                   | Uncorrectable memory error ; OEM Event Data2 code = C0h ; OEM Event Data3 code = 02h
44  | Jul-18-2017 | 15:55:10 | ECC Uncorr Err   | Memory                   | Uncorrectable memory error ; OEM Event Data2 code = C0h ; OEM Event Data3 code = 08h
50  | Mar-27-2018 | 16:12:53 | ECC Uncorr Err   | Memory                   | Uncorrectable memory error ; OEM Event Data2 code = C0h ; OEM Event Data3 code = 02h

The memory error situation when it comes to codfw cache hosts is pretty bad. Besides cp2006, cp2010, and cp2017 (found rebooting), I've now checked SEL and the following are also affected:

  • cp2008
  • cp2011
  • cp2018
  • cp2022

I'll skip rebooting them for kernel upgrades for now.

In comparison, only one eqiad cp host has the same issue (cp1074) and 3 in esams (cp[3034,3040,3045]). None in eqsin and ulsfo.

ema renamed this task from cp2006, cp2010, cp2017: Uncorrectable Memory Error to cp[2006,2008,2010-2011,2017-2018,2022].codfw.wmnet: Uncorrectable Memory Error.Mar 28 2018, 2:28 PM

These seem to be under warranty for another 2 months, so we should hurry up.

7 out of 22 identical hosts having memory errors sounds like a bad batch. @RobH, perhaps we should escalate this with Dell and make a fuss about it and ask for proactive replacements for the rest of the memory DIMMs? How has this been handled in the past?

RobH added a subscriber: Papaul.

After reviewing with traffic team, we're goign to test memory in all of these. I've updated the task description with the following:

Memory testing proceedure:

All tests should be recorded on this google sheet.

  • start at cp2001, and work upwards.
    • Systems that are already depooled can be tested at any time (suggest testing all depooled first, then starting at cp2001 and working upwards.)
    • Systems that are perma-depooled as spare can be tested at any time.
  • of the pooled systems, do not take down more than 1 server per pool (text/upload).
  • systems will be automatically depooled by pybal/confctl, no need to manually depool.
  • copy down any errors in the SEL before memory testing, the SEL has to be wiped prior to tests (or it will report on SEL errors.) (command: racadm getsel)
  • clear out the SEL on drac (command: racadm delsel)
  • run memory tests on system, monitor result and update this google sheet.
  • if system passes memory testing, reboot it back into its OS and note it on the google sheet. coordinate in #wikimedia-traffic to bring servers on and offline. Do not take down another system until we confirm the system just tested is back online.

Once we have memtesting of the entire batch, we'll have a better idea on how widespread this issue is.

@Papaul: Please start memtests on these hosts. work through the offline hosts and spare hosts first, and then start at cp2001 and work upwards following the directions above. Please feel free to coordinate with myself or someone in the traffic team as you start working on online hosts.

All those systems are running outdated IDRAC and BIOS version. I will like to update the IDRAC and BIOS first before running the memory test.

@BBlack @RobH I did the test on already 6 of the systems that are depooled and upgrade also the IDRAC and BIOS. You can see the result in the Google sheet. I am running the test now on cp2021 which is the last depooled system on the sheet. if you have a minute can you depool other cp systems and just update the sheet with the onces you depool.

Thanks.

@Papaul:
The remaining systems will need to be depooled and repooled one at a time for work, please coordinate with either myself or a member of traffic team via IRC for the remainder.

Thanks!

Well we should maybe pause at this point and ask if this test is doing any good? It seems odd that 3/6 tested had the SEL entries for multiple uncorrectables over time, yet the memtest comes back fine.

Please note I've asked @Papaul to memtest86+ cp2022 WITHOUT flashing the bios/drac.

Once we have that result, we'll also then start reboot looping cp2022 to attempt to re-create the memory post error in the SEL seen before. If it occurs, we'll then upgrade the bios and attempt to re-create the error.

cp2022 SEL
"Normal","Sat May 30 2015 03:52:02","Log cleared."
"Warning","Wed Jun 01 2016 17:39:30","Correctable memory error rate exceeded for DIMM_B2."
"Warning","Wed Jun 01 2016 17:39:35","Correctable memory error rate exceeded for DIMM_B6."
"Critical","Wed Jun 01 2016 17:39:39","Correctable memory error rate exceeded for DIMM_B2."
"Normal","Thu Oct 06 2016 16:26:01","A problem was detected in Memory Reference Code (MRC)."
"Critical","Thu Oct 06 2016 16:26:01","Multi-bit memory errors detected on a memory device at location(s) DIMM_B2."
"Warning","Thu Oct 06 2016 16:26:47","Fan 5A RPM is less than the lower warning threshold."
"Critical","Thu Oct 06 2016 16:26:47","Fan 5A RPM is less than the lower critical threshold."
"Warning","Thu Oct 06 2016 16:26:52","Fan 5A RPM is less than the lower warning threshold."
"Normal","Thu Oct 06 2016 16:26:53","Fan 5A RPM is within range."
"Warning","Thu Oct 06 2016 16:32:15","Correctable memory error rate exceeded for DIMM_B2."
"Warning","Thu Oct 06 2016 16:32:15","Correctable memory error rate exceeded for DIMM_B6."
"Critical","Thu Oct 06 2016 16:32:17","Correctable memory error rate exceeded for DIMM_B2."
"Warning","Thu Oct 06 2016 16:32:33","Fan 5A RPM is less than the lower warning threshold."
"Critical","Thu Oct 06 2016 16:32:33","Fan 5A RPM is less than the lower critical threshold."
"Warning","Thu Oct 06 2016 16:32:38","Fan 5A RPM is less than the lower warning threshold."
"Normal","Thu Oct 06 2016 16:32:38","Fan 5A RPM is within range."
"Warning","Wed Oct 26 2016 16:16:01","Correctable memory error rate exceeded for DIMM_B6."
"Critical","Wed Oct 26 2016 16:16:05","Correctable memory error rate exceeded for DIMM_B6."
"Normal","Wed Feb 08 2017 14:15:06","A problem was detected in Memory Reference Code (MRC)."
"Critical","Wed Feb 08 2017 14:15:06","Multi-bit memory errors detected on a memory device at location(s) DIMM_B2."
"Warning","Wed Feb 08 2017 14:15:51","Fan 5A RPM is less than the lower warning threshold."
"Critical","Wed Feb 08 2017 14:15:51","Fan 5A RPM is less than the lower critical threshold."
"Critical","Wed Feb 08 2017 14:15:55","Fan redundancy is lost."
"Warning","Wed Feb 08 2017 14:15:56","Fan 5A RPM is less than the lower warning threshold."
"Normal","Wed Feb 08 2017 14:15:56","Fan 5A RPM is within range."
"Normal","Wed Feb 08 2017 14:16:00","The fans are redundant."
"Warning","Wed Nov 08 2017 09:28:52","Correctable memory error rate exceeded for DIMM_B2."
"Warning","Wed Nov 08 2017 09:28:53","Correctable memory error rate exceeded for DIMM_B6."
"Critical","Wed Nov 08 2017 09:29:03","Correctable memory error rate exceeded for DIMM_B2."
"Warning","Thu Nov 09 2017 12:34:08","Correctable memory error rate exceeded for DIMM_B2."
"Warning","Thu Nov 09 2017 12:34:08","Correctable memory error rate exceeded for DIMM_B6."
"Critical","Thu Nov 09 2017 12:34:18","Correctable memory error rate exceeded for DIMM_B6."
"Warning","Tue Jan 09 2018 18:11:40","Correctable memory error rate exceeded for DIMM_B2."
"Warning","Tue Jan 09 2018 18:11:46","Correctable memory error rate exceeded for DIMM_B6."
"Critical","Tue Jan 09 2018 18:11:51","Correctable memory error rate exceeded for DIMM_B2."
"Normal","Wed Mar 28 2018 15:44:48","A problem was detected in Memory Reference Code (MRC)."
"Critical","Wed Mar 28 2018 15:44:48","Multi-bit memory errors detected on a memory device at location(s) DIMM_B2."
"Warning","Wed Mar 28 2018 15:45:33","Fan 5A RPM is less than the lower warning threshold."
"Normal","Wed Mar 28 2018 15:45:39","Fan 5A RPM is within range."
"Warning","Wed Mar 28 2018 15:56:53","Correctable memory error rate exceeded for DIMM_B2."
"Critical","Wed Mar 28 2018 15:56:53","Correctable memory error rate exceeded for DIMM_B2."
"Warning","Wed Mar 28 2018 15:56:53","Correctable memory error rate exceeded for DIMM_B6."
"Critical","Wed Mar 28 2018 15:56:53","Correctable memory error rate exceeded for DIMM_B6."

cp2022 SEL after test

"Normal","Thu Mar 29 2018 20:11:58","Log cleared."
"Warning","Thu Mar 29 2018 20:14:22","Fan 5A RPM is less than the lower warning threshold."
"Critical","Thu Mar 29 2018 20:14:22","Fan 5A RPM is less than the lower critical threshold."
"Critical","Thu Mar 29 2018 20:14:27","Fan redundancy is lost."
"Warning","Thu Mar 29 2018 20:14:28","Fan 5A RPM is less than the lower warning threshold."
"Normal","Thu Mar 29 2018 20:14:28","Fan 5A RPM is within range."
"Normal","Thu Mar 29 2018 20:14:33","The fans are redundant."

cp2022 SEL after test

"Normal","Thu Mar 29 2018 20:11:58","Log cleared."
"Warning","Thu Mar 29 2018 20:14:22","Fan 5A RPM is less than the lower warning threshold."
"Critical","Thu Mar 29 2018 20:14:22","Fan 5A RPM is less than the lower critical threshold."
"Critical","Thu Mar 29 2018 20:14:27","Fan redundancy is lost."
"Warning","Thu Mar 29 2018 20:14:28","Fan 5A RPM is less than the lower warning threshold."
"Normal","Thu Mar 29 2018 20:14:28","Fan 5A RPM is within range."
"Normal","Thu Mar 29 2018 20:14:33","The fans are redundant."

IRC Update:

Papaul copied the SEL to the task, cleared it, and ran memtest86+ on cp2022. No errors resulted in tests, and then 2 additional reboots showed no errors in SEL.

Next steps:

  • reboot cp2022 in serial console a dozen times and watch it
    • if it has the error, it shows during post AND pushes to SEL
    • if it has the error, try to recreate multiple times, and then flash bios
  • see if new bios shows the error when rebooted a dozen times.

So I've rebooted cp2022 12 times, attempting to re-create the memory error that @ema experienced on this machine (and is demonstrated in the systems SEL.)

I could not get the memory error to occur at any time. Additionally, @Papaul also ran memtest86+ on this system, and it showed no errors.

Basically we're not able to recreate any of the errors seen. On most of the hosts, we updated the bios before the reboot testing, however cp2022 has not had the bios updated, and we still cannot re-create the error.

At this point I'd suggest we simply move through the rest of this particular fleet, upgrading bios/drac and running single memtests. There isn't a whole lot more I can think to do. If we attempt to return them to Dell under warranty repair, they will ask for memtest failure codes, and we cannot furnish them.

Reasonably, we could possibly demand that we get new dimms for all the ones reporting an error in the SEL. However, I'm not sure we can reasonable ask them to replace all the dimms in these hosts, since most have not reported any errors (plus the ones that did we cannot reproduce.)

I've reviewed the next steps with @BBlack via IRC.

@Papaul ran memtest on all of the machines reporting failed memory, and they all pass. However, the memory errors on each of the sub-task systems has had multiple failures in their SEL for each of the dimms in question. Each sub-task has a copy of the TSR (tech support report) required by Dell for warranty repairs, along with select entries pulled out of the report to demonstrate the dimm failures.

I don't think we can reasonably expect Dell to replace memory that hasn't reported errors, but these seem perfectly acceptable to request replacement on.

I think with that, this parent task can be resolved, unless @BBlack thinks we need to memtest the remainder of the fleet? With memtest not showing the error, I have no confidence in it being a worthwhile test at this point.

This comment has been deleted.