Page MenuHomePhabricator

cp1087 down with hardware issues
Closed, ResolvedPublic

Description

Hello!

I had to depool and powercycle cp1087, it was reported down by icinga and indeed no ssh or mgmt serial console tty was available. This is the output of racadm getsel:

-------------------------------------------------------------------------------                         [61/941]
Record:      146                                                                                               
Date/Time:   03/30/2021 03:00:44                                                                               
Source:      system                                                                                            
Severity:    Critical                                                                                          
Description: CPU 1 machine check error detected.                                                               
-------------------------------------------------------------------------------                                
Record:      147                                                                                               
Date/Time:   03/30/2021 03:00:44                                                                               
Source:      system                                                                                            
Severity:    Ok                                                                                                
Description: An OEM diagnostic event occurred.                                                                 
-------------------------------------------------------------------------------                                
[..]                                                         
-------------------------------------------------------------------------------                                
Record:      155                                                                                               
Date/Time:   03/30/2021 02:04:04                                                                               
Source:      system                                                                                            
Severity:    Ok                                                                                                
Description: A problem was detected related to the previous server boot.                                       
-------------------------------------------------------------------------------  
Record:      156                                                                                               
Date/Time:   03/30/2021 02:04:04                                                                               
Source:      system                                                                                          
Severity:    Critical                                                                                        
Description: Multi-bit memory errors detected on a memory device at location(s) DIMM_A6.     
-------------------------------------------------------------------------------                              
Record:      157                                                                                             
Date/Time:   03/30/2021 02:04:04                                                                             
Source:      system                                                                                          
Severity:    Critical                                                                                        
Description: CPU 1 machine check error detected.                                                             
-------------------------------------------------------------------------------                              
Record:      158                                                                                             
Date/Time:   03/30/2021 02:04:04                                                                             
Source:      system                                                                                          
Severity:    Ok                                                                                              
Description: An OEM diagnostic event occurred.                                                               
-------------------------------------------------------------------------------                              
[..]
-------------------------------------------------------------------------------             
Record:      165                                                                                             
Date/Time:   03/30/2021 02:04:05                                                                             
Source:      system                                                                                          
Severity:    Ok                                                                                              
Description: An OEM diagnostic event occurred.

I'll leave the next steps to the Traffic team :)

Event Timeline

jijiki triaged this task as Medium priority.Mar 30 2021, 7:19 AM

Seems ok for the ~14h it's been back online so far. I'm going to re-pool this and tentatively resolve the ticket hoping it's a fluke event, but not clear the SEL. If we get a recurrence, we'll re-open and kick this over to dcops.

BBlack claimed this task.

Mentioned in SAL (#wikimedia-operations) [2021-04-01T06:37:07Z] <elukey> powercycle cp1087 (no ssh, no tty via serial console) - T278729

elukey added a project: ops-eqiad.

Happened again, just depooled and powercycled, going to add the ops-eqiad tag!

elukey removed BBlack as the assignee of this task.Apr 1 2021, 6:38 AM
elukey added a subscriber: Cmjohnson.

Looks like a possible DIMM error, since the server is already depooled I will run a couple of tests to determine if it's a DIMM, CPU or motherboard issue.

Mentioned in SAL (#wikimedia-operations) [2021-04-08T16:16:46Z] <cmjohnson1> update bios cp1087, already deposed for h/w issues T278729

updated the BIOS and submitted Dell ticket You have successfully submitted request SR1056516502.

replaced cpu1 and cleared the idrac log, resolving, if the issue returns please re-open.

To keep archives happy: repooled after a chat with Brandon :)

The issue came back, the host is down again :(

-------------------------------------------------------------------------------                                 
Record:      1019
Date/Time:   05/29/2021 14:37:53
Source:      system
Severity:    Ok
Description: The persistent correctable memory error rate is at normal levels for a memory device at location DIMM_A6.
-------------------------------------------------------------------------------                                 
Record:      1020
Date/Time:   05/29/2021 14:37:53
Source:      system
Severity:    Ok
Description: An OEM diagnostic event occurred.
-------------------------------------------------------------------------------                                 
Record:      1021
Date/Time:   05/29/2021 14:37:53
Source:      system
Severity:    Ok
Description: An OEM diagnostic event occurred.
-------------------------------------------------------------------------------                                 
Record:      1022
Date/Time:   05/29/2021 14:37:53
Source:      system
Severity:    Ok
Description: An OEM diagnostic event occurred.
-------------------------------------------------------------------------------                                 
Record:      1023
Date/Time:   05/29/2021 14:37:53
Source:      system
Severity:    Ok
Description: An OEM diagnostic event occurred.
-------------------------------------------------------------------------------                                 
Record:      1024
Date/Time:   05/29/2021 14:37:53
Source:      system
Severity:    Ok
Description: An OEM diagnostic event occurred.
-------------------------------------------------------------------------------

Setting the host to inactive in conftool, didn't powercycled since it may probably require some dcops intervention (will defer the decision to the Traffic team).

@Cmjohnson: this is still happening unfortunately. The host is currently down and depooled, please feel free to try anything else that comes to mind. No heads-up needed.

@ema Looks to be a DIMM issue, submitted a ticket to Dell

You have successfully submitted request SR1061284651.

ema renamed this task from cp1087 powercycled to cp1087 down with hardware issues.Jun 2 2021, 7:44 AM

@ema
Replaced the DIMM A6, powered on and replacement recgonized.
Message PR1: Replaced part detected for device: DDR4 DIMM(Socket A6).

Booted to the OS
Cleared the idrac log
Resolving this on my end, if you marked failed in netbox please update when you add back to production.

Mentioned in SAL (#wikimedia-operations) [2021-06-03T18:37:35Z] <dzahn@cumin1001> START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on cp1087.eqiad.wmnet with reason: replaced DIMM https://phabricator.wikimedia.org/T278729

Mentioned in SAL (#wikimedia-operations) [2021-06-03T18:37:39Z] <dzahn@cumin1001> END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on cp1087.eqiad.wmnet with reason: replaced DIMM https://phabricator.wikimedia.org/T278729

Tentatively closing.

18:38 < icinga-wm> PROBLEM - Check systemd state on cp1087 is CRITICAL: CRITICAL - degraded: The following units failed: rsyslog.service,syslog.socket

https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state

rsyslogd was down for repeatedly segfaulting on startup. I was able to strace the failure and see that it kept segfaulting while reading one of its own files in /var/spool/rsyslog/ on startup, which was probably corrupted somehow during a prior crash. Deleting the spool files let rsyslog start up properly, but I think at this point we're better off reimaging instead of waiting to find (or never find) some other more-subtle corruption.

Script wmf-auto-reimage was launched by bblack on cumin1001.eqiad.wmnet for hosts:

cp1087.eqiad.wmnet

The log can be found in /var/log/wmf-auto-reimage/202106041952_bblack_19504_cp1087_eqiad_wmnet.log.

Completed auto-reimage of hosts:

['cp1087.eqiad.wmnet']

and were ALL successful.