[05:17:50] <+icinga-wm> PROBLEM - Host db2110 #page is DOWN: PING CRITICAL - Packet loss = 100%
Description
Details
Project | Branch | Lines +/- | Subject | |
---|---|---|---|---|
operations/puppet | production | +2 -2 | mariadb: Make db2179 candidate master for s4 |
Event Timeline
I haven't been able to find anything on why this host crashed. However, this host is the candidate master for s4, so I am going to move that role to a different host just in case.
Maybe this can be useful in picking the next candidate https://fault-tolerance.toolforge.org/map?cluster=s4
and https://fault-tolerance.toolforge.org/map?cluster=db-master-candidates
Change 923261 had a related patch set uploaded (by Marostegui; author: Marostegui):
[operations/puppet@production] mariadb: Make db2179 candidate master for s4
Change 923261 merged by Marostegui:
[operations/puppet@production] mariadb: Make db2179 candidate master for s4
It looks like an IME exception:
2023-05-25 05:16:13 SYS1003 System CPU Resetting. Log Sequence Number: 267 Detailed Description: System is performing a CPU reset because of system power off, power on or a warm reset like CTRL-ALT-DEL. Recommended Action: No response action is required. 2023-05-25 05:16:05 SYS1000 System is turning on. Log Sequence Number: 266 Detailed Description: System is turning on. Recommended Action: No response action is required. 2023-05-25 05:16:02 PWR2271 The Intel Management Engine has encountered a Exception Event. Log Sequence Number: 265 Detailed Description: The Intel Management Engine has encountered a Exception Event. Recommended Action: Perform an AC Cycle operation on the host server, and then update the BIOS firmware to the latest version. If the issue persists, contact your service provider. For information about recommended BIOS versions, see the BIOS documentation on the support site. 2023-05-25 05:15:54 SYS1001 System is turning off. Log Sequence Number: 264 Detailed Description: System is turning off. Recommended Action: No response action is required. 2023-05-25 05:15:54 SYS1003 System CPU Resetting. Log Sequence Number: 263 Detailed Description: System is performing a CPU reset because of system power off, power on or a warm reset like CTRL-ALT-DEL. Recommended Action: No response action is required. 2023-05-25 05:15:37 RAC0703 Requested system hardreset. Log Sequence Number: 262 Detailed Description: Requested system hardreset. Recommended Action: No response action is required. 2023-05-25 05:15:16 CPU0000 Internal error has occurred check for additional logs. Log Sequence Number: 261 Detailed Description: System event log and OS logs may indicate the source of the error. Recommended Action: Review System Event Log and Operating System Logs. These logs can help the user identify the possible issue that is producing the problem. 2023-05-04 09:42:46 SYS1003 System CPU Resetting. Log Sequence Number: 260 Detailed Description: System is performing a CPU reset because of system power off, power on or a warm reset like CTRL-ALT-DEL. Recommended Action: No response action is required.
@Papaul @wiki_willy this server is out of warranty right? I don't know if there's much we can do about
2023-05-25 05:16:13 SYS1003 System CPU Resetting.
Hi @Marostegui - Papaul is on paternity leave for another week, so I'm going to pass this over to @Jhancock.wm to check out. The server is about 4yrs old, so it's out of warranty, but there might be parts that could be pulled from a decommissioned server if we're able to isolate the issue. Thanks, Willy
Yeah, I wonder if there's anything we can do to troubleshoot this from a hardware point of view.
@Marostegui I am looking for a suitable cpu replacement in our decommissioned servers. In the meantime Log Event 265 recommends a BIOS update. The bios is very out of date on this one and I am running that task now.
@Marostegui
the BIOS update is complete.
I found a suitable CPU replacement. Do we want to give that a try now or see if the BIOS update did the trick.
LMK if you wanna swap and if it's safe to do so at this time. thanks!
I replace both since we're not sure. server has booted without issues. all components are green in the idrac dashboard. it's all yours now!
I do see some slight discoloration on the old CPU2. not sure if it's from regular use or an undiagnosed issue.
I've put the old CPUs in the server with the tag 11V3DP2.
Thank you!. I'll bring Mariadb up on Monday and leave it running for a few days before repooling it, to make sure everything is stable