Page MenuHomePhabricator

db2110 crashed
Closed, ResolvedPublic

Description

[05:17:50] <+icinga-wm> PROBLEM - Host db2110 #page is DOWN: PING CRITICAL - Packet loss = 100%

Event Timeline

The uptime is 2:30 so it got rebooted, not like mysql going down and paging.

Yes, that's what I mean with crashed :)

I haven't been able to find anything on why this host crashed. However, this host is the candidate master for s4, so I am going to move that role to a different host just in case.

Definitely! Thanks! I am going to pick db2179

Change 923261 had a related patch set uploaded (by Marostegui; author: Marostegui):

[operations/puppet@production] mariadb: Make db2179 candidate master for s4

https://gerrit.wikimedia.org/r/923261

Change 923261 merged by Marostegui:

[operations/puppet@production] mariadb: Make db2179 candidate master for s4

https://gerrit.wikimedia.org/r/923261

It looks like an IME exception:

LifecycleLog
	2023-05-25 05:16:13 	SYS1003 	System CPU Resetting.	
	
Log Sequence Number:
267
Detailed Description:
System is performing a CPU reset because of system power off, power on or a warm reset like CTRL-ALT-DEL.
Recommended Action:
No response action is required.
		2023-05-25 05:16:05 	SYS1000 	System is turning on.	
	
Log Sequence Number:
266
Detailed Description:
System is turning on.
Recommended Action:
No response action is required.
		2023-05-25 05:16:02 	PWR2271 	The Intel Management Engine has encountered a Exception Event.	
	
Log Sequence Number:
265
Detailed Description:
The Intel Management Engine has encountered a Exception Event.
Recommended Action:
Perform an AC Cycle operation on the host server, and then update the BIOS firmware to the latest version. If the issue persists, contact your service provider. For information about recommended BIOS versions, see the BIOS documentation on the support site.
		2023-05-25 05:15:54 	SYS1001 	System is turning off.	
	
Log Sequence Number:
264
Detailed Description:
System is turning off.
Recommended Action:
No response action is required.
		2023-05-25 05:15:54 	SYS1003 	System CPU Resetting.	
	
Log Sequence Number:
263
Detailed Description:
System is performing a CPU reset because of system power off, power on or a warm reset like CTRL-ALT-DEL.
Recommended Action:
No response action is required.
		2023-05-25 05:15:37 	RAC0703 	Requested system hardreset.	
	
Log Sequence Number:
262
Detailed Description:
Requested system hardreset.
Recommended Action:
No response action is required.
		2023-05-25 05:15:16 	CPU0000 	Internal error has occurred check for additional logs.	
	
Log Sequence Number:
261
Detailed Description:
System event log and OS logs may indicate the source of the error.
Recommended Action:
Review System Event Log and Operating System Logs. These logs can help the user identify the possible issue that is producing the problem.
		2023-05-04 09:42:46 	SYS1003 	System CPU Resetting.	
	
Log Sequence Number:
260
Detailed Description:
System is performing a CPU reset because of system power off, power on or a warm reset like CTRL-ALT-DEL.
Recommended Action:
No response action is required.
Marostegui added subscribers: wiki_willy, Papaul.

@Papaul @wiki_willy this server is out of warranty right? I don't know if there's much we can do about

2023-05-25 05:16:13 	SYS1003 	System CPU Resetting.
wiki_willy added a subscriber: Jhancock.wm.

Hi @Marostegui - Papaul is on paternity leave for another week, so I'm going to pass this over to @Jhancock.wm to check out. The server is about 4yrs old, so it's out of warranty, but there might be parts that could be pulled from a decommissioned server if we're able to isolate the issue. Thanks, Willy

Yeah, I wonder if there's anything we can do to troubleshoot this from a hardware point of view.

@Marostegui I am looking for a suitable cpu replacement in our decommissioned servers. In the meantime Log Event 265 recommends a BIOS update. The bios is very out of date on this one and I am running that task now.

@Marostegui
the BIOS update is complete.
I found a suitable CPU replacement. Do we want to give that a try now or see if the BIOS update did the trick.
LMK if you wanna swap and if it's safe to do so at this time. thanks!

Let's go for the CPU swap too. You can do it anytime. The host isn't in use

I forgot to ask. was it CPU1 or CPU2 that was having the issue?

It doesn't say on the error:

	2023-05-25 05:16:13 	SYS1003 	System CPU Resetting.

I replace both since we're not sure. server has booted without issues. all components are green in the idrac dashboard. it's all yours now!

I do see some slight discoloration on the old CPU2. not sure if it's from regular use or an undiagnosed issue.
I've put the old CPUs in the server with the tag 11V3DP2.

Thank you!. I'll bring Mariadb up on Monday and leave it running for a few days before repooling it, to make sure everything is stable

The host is repooled. Thanks for your help!