Page MenuHomePhabricator

cp1087 powercycled
Open, MediumPublic

Description

Hello!

I had to depool and powercycle cp1087, it was reported down by icinga and indeed no ssh or mgmt serial console tty was available. This is the output of racadm getsel:

-------------------------------------------------------------------------------                         [61/941]
Record:      146                                                                                               
Date/Time:   03/30/2021 03:00:44                                                                               
Source:      system                                                                                            
Severity:    Critical                                                                                          
Description: CPU 1 machine check error detected.                                                               
-------------------------------------------------------------------------------                                
Record:      147                                                                                               
Date/Time:   03/30/2021 03:00:44                                                                               
Source:      system                                                                                            
Severity:    Ok                                                                                                
Description: An OEM diagnostic event occurred.                                                                 
-------------------------------------------------------------------------------                                
[..]                                                         
-------------------------------------------------------------------------------                                
Record:      155                                                                                               
Date/Time:   03/30/2021 02:04:04                                                                               
Source:      system                                                                                            
Severity:    Ok                                                                                                
Description: A problem was detected related to the previous server boot.                                       
-------------------------------------------------------------------------------  
Record:      156                                                                                               
Date/Time:   03/30/2021 02:04:04                                                                               
Source:      system                                                                                          
Severity:    Critical                                                                                        
Description: Multi-bit memory errors detected on a memory device at location(s) DIMM_A6.     
-------------------------------------------------------------------------------                              
Record:      157                                                                                             
Date/Time:   03/30/2021 02:04:04                                                                             
Source:      system                                                                                          
Severity:    Critical                                                                                        
Description: CPU 1 machine check error detected.                                                             
-------------------------------------------------------------------------------                              
Record:      158                                                                                             
Date/Time:   03/30/2021 02:04:04                                                                             
Source:      system                                                                                          
Severity:    Ok                                                                                              
Description: An OEM diagnostic event occurred.                                                               
-------------------------------------------------------------------------------                              
[..]
-------------------------------------------------------------------------------             
Record:      165                                                                                             
Date/Time:   03/30/2021 02:04:05                                                                             
Source:      system                                                                                          
Severity:    Ok                                                                                              
Description: An OEM diagnostic event occurred.

I'll leave the next steps to the Traffic team :)

Event Timeline

jijiki triaged this task as Medium priority.Tue, Mar 30, 7:19 AM

Seems ok for the ~14h it's been back online so far. I'm going to re-pool this and tentatively resolve the ticket hoping it's a fluke event, but not clear the SEL. If we get a recurrence, we'll re-open and kick this over to dcops.

BBlack claimed this task.

Mentioned in SAL (#wikimedia-operations) [2021-04-01T06:37:07Z] <elukey> powercycle cp1087 (no ssh, no tty via serial console) - T278729

elukey added a project: ops-eqiad.

Happened again, just depooled and powercycled, going to add the ops-eqiad tag!

elukey removed BBlack as the assignee of this task.Thu, Apr 1, 6:38 AM
elukey added a subscriber: Cmjohnson.

Looks like a possible DIMM error, since the server is already depooled I will run a couple of tests to determine if it's a DIMM, CPU or motherboard issue.

Mentioned in SAL (#wikimedia-operations) [2021-04-08T16:16:46Z] <cmjohnson1> update bios cp1087, already deposed for h/w issues T278729

updated the BIOS and submitted Dell ticket You have successfully submitted request SR1056516502.