Page MenuHomePhabricator

cp3034: Uncorrectable Memory Error
Closed, ResolvedPublic

Description

While rebooting cp3034 for kernel upgrades today, the host failed coming up with the following message:

Enumerating Boot options... Done                                                                                                                               
                                                                                                                                                               
UEFI0107: One or more memory errors have occurred on memory slot: B3.                                                                                          
Remove input power to the system, reseat the DIMM module and restart the                                                                                       
system. If the issues persist, then replace the faulty memory module identified                                                                                
in the message.                                                                                                                                                
                                                                                                                                                               
UEFI0058: Uncorrectable Memory Error has occurred because a Dual Inline Memory                                                                                 
Module (DIMM) is not functioning.                                                                                                                              
Check the System Event Log (SEL) to identify the non-functioning DIMM, and then                                                                                
replace it.                                                                                                                                                    
                                                                                                                                                               
UEFI0030: A keyboard device is not connected to the system.                                                                                                    
Connect a keyboard device to the system.                                                                                                                       
                                                                                                                                                               
                                                                                                                                                               
Press F1 to continue, F2 for system setup, F10 for lifecycle controller, F11                                                                                   
for boot manager.

The system did boot properly after a powercycle. There are a few "uncorrectable memory errors" in SEL though:

$ sudo ipmi-sel -v | grep Uncorrectable                                                                                                                       
12  | Oct-06-2016 | 14:56:38 | ECC Uncorr Err   | Memory                   | Assertion Event   | Uncorrectable memory error ; OEM Event Data2 code = C1h ; OEM Event Data3 code = 10h
22  | Jul-18-2017 | 11:20:10 | ECC Uncorr Err   | Memory                   | Assertion Event   | Uncorrectable memory error ; OEM Event Data2 code = C0h ; OEM Event Data3 code = 08h
24  | Jul-18-2017 | 11:20:10 | ECC Uncorr Err   | Memory                   | Assertion Event   | Uncorrectable memory error ; OEM Event Data2 code = C1h ; OEM Event Data3 code = 40h
30  | Mar-09-2018 | 15:16:40 | ECC Uncorr Err   | Memory                   | Assertion Event   | Uncorrectable memory error ; OEM Event Data2 code = C1h ; OEM Event Data3 code = 40h

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald Transcript
ema triaged this task as Medium priority.Mar 9 2018, 3:33 PM

See also T183177 (why aren't we getting runtime icinga alerts when these happen, via EDAC?)

Also, depooled for now, since we can't trust the uncorrected memory errors not causing production issues:
16:07 <+logmsgbot> !log bblack@neodymium conftool action : set/pooled=no; selector: name=cp3034.esams.wmnet

mark raised the priority of this task from Medium to High.Jul 3 2018, 12:46 PM
Stashbot subscribed.

Mentioned in SAL (#wikimedia-traffic) [2018-07-04T11:30:17Z] <ema> shutdown cp3048 and cp3034 (both already depooled) for hardware maintenance T190607 T189305

mark subscribed.

Swapped DIMM B3 with DIMM B3 from cp3048 (parts donor). Server booted up just fine afterwards.

Mentioned in SAL (#wikimedia-traffic) [2018-07-04T12:39:08Z] <ema> cp3034 repooled after hw maintenance T189305