Page MenuHomePhabricator

db2214 crashed
Closed, ResolvedPublic

Description

We got paged for this today and it has been depooled but it seems as per Icinga that the host is down. I have not done any other investigation beyond just opening this task.

It seems to have recovered a few minutes later but the host has not been pooled again.

Event Timeline

Mentioned in SAL (#wikimedia-operations) [2024-04-04T15:36:08Z] <sukhe@cumin2002> START - Cookbook sre.hosts.downtime for 7 days, 0:00:00 on db2214.codfw.wmnet with reason: depooled, see T361851

Mentioned in SAL (#wikimedia-operations) [2024-04-04T15:36:23Z] <sukhe@cumin2002> END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 7 days, 0:00:00 on db2214.codfw.wmnet with reason: depooled, see T361851

jcrespo renamed this task from db2214 is down to db2214 crashed.Thu, Apr 4, 3:38 PM
jcrespo subscribed.

Looked like a host server crash.

Sadly I was unable to log in using http. This is the generic error I got on command line:

--------------------------------------------------------------------------------                                              
SeqNumber       = 11317                                                                                                       
Message ID      = CTL129                                                                                                      
Category        = Storage                                                                                                     
AgentID         = iDRAC                                                                                                       
Severity        = Information                                                                                                 
Timestamp       = 2024-04-04 15:33:05                                                                                         
Message         = The boot media of the Controller RAID Controller in SL 3 is Disk.Virtual.239:RAID.SL.3-1.                   
Message Arg   1 = RAID Controller in SL 3                                                                                     
Message Arg   2 = Disk.Virtual.239:RAID.SL.3-1                                                                                
FQDD            = RAID.SL.3-1                                                                                                 
--------------------------------------------------------------------------------                                              
SeqNumber       = 11316                                                                                                       
Message ID      = SYS1003                                                                                                     
Category        = Audit                                                                                                       
AgentID         = DE                                                                                                          
Severity        = Information                                                                                                 
Timestamp       = 2024-04-04 16:29:32                                                                                         
Message         = System CPU Resetting.                                                                                       
FQDD            = iDRAC.Embedded.1#HostPowerCtrl                                                                              
--------------------------------------------------------------------------------                                              
SeqNumber       = 11315                                                                                                       
Message ID      = SYS1000                                                                                                     
Category        = Audit                                                                                                       
AgentID         = DE                                                                                                          
Severity        = Information                                                                                                 
Timestamp       = 2024-04-04 16:29:06                                                                                         
Message         = System is turning on.                                                                                       
FQDD            = iDRAC.Embedded.1#HostPowerCtrl                                                                              
--------------------------------------------------------------------------------                                              
SeqNumber       = 11314                                                                                                       
Message ID      = SWC5019                                                                                                     
Category        = System                                                                                                      
AgentID         = DE                                                                                                          
Severity        = Warning                                                                                                     
Timestamp       = 2024-04-04 16:29:00                                                                                         
Message         = Unable to authenticate the BIOS image file because:  Internal Errors: Bypassing bios verification and booting the host.                                                                                                                   
Message Arg   1 =  Internal Errors: Bypassing bios verification and booting the host                                          
--------------------------------------------------------------------------------                                              
SeqNumber       = 11313                                                                                                       
Message ID      = RAC0701                                                                                                     
Category        = Audit                                                                                                       
AgentID         = RACLOG                                                                                                      
Severity        = Information                                                                                                 
Timestamp       = 2024-04-04 16:27:58                                                                                         
Message         = Requested system powerup.                                                                                   
FQDD            = iDRAC.Embedded.1                                                                                            
--------------------------------------------------------------------------------                                              
SeqNumber       = 11312                                                                                                       
Message ID      = SYS1001                                                                                                     
Category        = Audit                                                                                                       
AgentID         = DE                                                                                                          
Severity        = Information                                                                                                 
Timestamp       = 2024-04-04 16:26:25                                                                                         
Message         = System is turning off.                                                                                      
FQDD            = iDRAC.Embedded.1#HostPowerCtrl                                                                              
--------------------------------------------------------------------------------                                              
SeqNumber       = 11311                                                                                                       
Message ID      = SYS1003                                                                                                     
Category        = Audit                                                                                                       
AgentID         = DE                                                                                                          
Severity        = Information                                                                                                 
Timestamp       = 2024-04-04 16:26:25                                                                                         
Message         = System CPU Resetting.                                                                                       
FQDD            = iDRAC.Embedded.1#HostPowerCtrl                                                                              
--------------------------------------------------------------------------------                                              
SeqNumber       = 11310                                                                                                       
Message ID      = NIC100                                                                                                      
Category        = System                                                                                                      
AgentID         = iDRAC                                                                                                       
Severity        = Warning                                                                                                     
Timestamp       = 2024-04-04 16:26:25                                                                                         
Message         = The Embedded NIC 1 Port 1 network link is down.                                                             
Message Arg   1 = Embedded NIC 1                                                                                              
Message Arg   2 = 1                                                                                                           
FQDD            = NIC.Embedded.1-1-1
ABran-WMF changed the task status from Open to In Progress.EditedFri, Apr 5, 8:06 AM
ABran-WMF triaged this task as Medium priority.

T361911 opened to bring back that server to production.

Change #1017081 had a related patch set uploaded (by Arnaudb; author: Arnaudb):

[operations/puppet@production] mariadb: toggle notifications for db2214

https://gerrit.wikimedia.org/r/1017081

Server seems to handle CPU/RAM stress quite well. From those tests and the logs in T361851#9689410 it is not possible to identify what made the server crash.

Change #1017081 merged by Arnaudb:

[operations/puppet@production] mariadb: toggle notifications for db2214

https://gerrit.wikimedia.org/r/1017081

Jhancock.wm subscribed.

Here are some more logs.

		2024-04-04 16:29:06 	SYS1000 	System is turning on.	
		2024-04-04 16:29:00 	SWC5019 	Unable to authenticate the BIOS image file because:  Internal Errors: Bypassing bios verification and booting the host.	
		2024-04-04 16:27:58 	RAC0701 	Requested system powerup.	
		2024-04-04 16:26:25 	SYS1001 	System is turning off.	
		2024-04-04 16:26:25 	SYS1003 	System CPU Resetting.	
		2024-04-04 16:26:25 	NIC100 		The Embedded NIC 1 Port 1 network link is down.	
		2024-04-01 17:57:27 	SYS336 		An existing hash value is updated because some system configuration items are changed.
		2024-03-30 06:05:05 	CTL38 		The Patrol Read operation completed for RAID Controller in SL 3.	
		2024-03-30 03:59:38 	CTL37 		A Patrol Read operation started for RAID Controller in SL 3.	
		2024-03-30 03:59:38 	LOG007 		The previous log entry was repeated 5 times.	
		2024-03-25 16:26:12 	SYS336 		An existing hash value is updated because some system configuration items are changed.

The only thing I could find in the history were a few bus errors in late February, which is when I was updating and installing this server after being racked. But those events stopped same day and did not repeat.

		2024-02-28 01:35:16 	CPU9000 	An OEM diagnostic event occurred.	
		2024-02-28 01:35:14 	PCI1318 	A fatal error was detected on a component at bus 4 device 0 function 0.	
		2024-02-28 01:35:14 	NIC100 	The Embedded NIC 1 Port 1 network link is down.

		2024-02-27 19:13:40 	SYS1003 	System CPU Resetting.	
		2024-02-27 19:13:39 	NIC100 	The Embedded NIC 1 Port 1 network link is down.	
		2024-02-27 19:13:39 	PCI1318 	A fatal error was detected on a component at bus 4 device 0 function 0.

I've found more recent updates for the BIOS and iDRAC and applied them to the server.

thanks @Jhancock.wm, I'll repool the server and we'll see how it goes