Page MenuHomePhabricator

db1253 depooled following host crash
Closed, ResolvedPublic

Description

Crash:

	2026-03-13 17:53:52 	PWR2402 	iDRAC is unable to communicate with power management firmware.	
	
Log Sequence Number:
757
Detailed Description:
The iDRAC controller cannot communicate with the power management firmware due to an problem with the interface to the power management engine or with the power management engine itself. The system may operate in a performance degraded state.
Recommended Action:
Check the Lifecycle Controller Log (LC Log) and make sure that there are no subsequent log entries indicating that Communication with power management firmware has been restored. If this log entry is not present then do one of the following: 1) Disconnect system input power, wait one minute, reconnect system input power. 2) Re-flash system BIOS. 3) Upgrade system BIOS to the latest revision.
18:03 <elukey> !log powercycle db1253 - host not reachable via ssh, no events logged in racadm getsel, no console com2 available (blank screen)
18:06 <+logmsgbot> !log cgoubert@cumin1003 dbctl commit (dc=all): 'Depool db1253', diff saved to https://phabricator.wikimedia.org/P89856 and previous config saved to /var/cache/conftool/dbconfig/20260313-180640-cgoubert.json

Details

Related Changes in Gerrit:

Event Timeline

Nothing in racadm getsal, nothing in mariadb's journalctl and dmesg, so I am inclined to mark this as an hardware-related stall (when I tried to ssh via serial console the console com2 was a blank screen).

Change #1252002 had a related patch set uploaded (by Elukey; author: Elukey):

[operations/puppet@production] Disable notifications for db1253

https://gerrit.wikimedia.org/r/1252002

the last log before the reboot was:

Mar 13 17:56:36 db1253 sshd[766141]: Connection closed by 208.80.154.78 port 60260 [preauth]

On grafana the last datapoints seem to be from 18:00

Logs extracted from idrac shows it losing connection with the firmware but this should be the effect of the OS crashing rather than the cause

SeqNumber       = 758
Message ID      = USR0030
Category        = Audit
AgentID         = RACLOG
Severity        = Information
Timestamp       = 2026-03-13 17:57:21
Message         = Successfully logged in using root, from 10.64.16.154 and SSH.
Message Arg   1 = root
Message Arg   2 = 10.64.16.154
Message Arg   3 = SSH
FQDD            = iDRAC.Embedded.1
--------------------------------------------------------------------------------
SeqNumber       = 757
Message ID      = PWR2402
Category        = System
AgentID         = iDRAC
Severity        = Critical
Timestamp       = 2026-03-13 17:53:52
Message         = iDRAC is unable to communicate with power management firmware.
FQDD            = iDRAC.Embedded.1
--------------------------------------------------------------------------------
SeqNumber       = 756
Message ID      = CTL38
Category        = Storage
AgentID         = iDRAC
Severity        = Information
Timestamp       = 2026-03-07 08:45:56
Message         = The Patrol Read operation completed for RAID Controller in SL 3.
Message Arg   1 = RAID Controller in SL 3
FQDD            = RAID.SL.3-1
FCeratto-WMF changed the task status from Open to In Progress.Mar 16 2026, 10:50 AM
FCeratto-WMF claimed this task.
FCeratto-WMF triaged this task as High priority.

DC-Ops: could you please check if everything is ok on the bios/firmware side and if there are hardware issues?

FCeratto-WMF changed the task status from In Progress to Open.Mar 16 2026, 11:05 AM
Ladsgroup subscribed.

We still will need to do work once it's been fixed and need to be informed of the progress. Putting back the DBA tag.

Mentioned in SAL (#wikimedia-operations) [2026-03-20T18:14:39Z] <jhathaway@cumin1003> DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 14 days, 0:00:00 on db1253.eqiad.wmnet with reason: T420041

Adding the ops-eqiad tag and removing ops-eqdfw. @Jclark-ctr will take a look at it a bit later today.

Icinga downtime and Alertmanager silence (ID=5ce7e720-a20c-4ad5-a612-bbf5c41ccd0a) set by fceratto@cumin1003 for 14 days, 0:00:00 on 1 host(s) and their services with reason: Under repair

db1253.eqiad.wmnet

I physically checked power cables seated properly nothing loose.

Idrac looked healthy. i went though and updated multiple firmwares

800w Delta psu from 00.1B.53 To 00.1B.BF
Idrac 5.4.0.0 to 7.30.10.50
Bios 1.15.2 to 1.20.2
Backplane 7.16 to 7.16_A00_01

Thank you @Jclark-ctr - is there anything else to be done on your side or can I claim the task?

BIOS required a second restart. Just finished—should be good now. I double-checked the logs again just now still looks good.

@FCeratto-WMF Feel free to Message me if anything else is needed

To summarize, no hardware fault was detected in idrac, racadm, journald logs, dmesg.
After the maintenance freeze we could clone the host and repool it.
We could consider running cpu, memory and I/O stress tests in the meantime just in case.

To summarize, no hardware fault was detected in idrac, racadm, journald logs, dmesg.
After the maintenance freeze we could clone the host and repool it.
We could consider running cpu, memory and I/O stress tests in the meantime just in case.

What are the crash synthoms, even if there was no logs, how does it fail on graphs/recovery? Does it reboot? Does it get freezed and requires a hard reboot? Did it kernel panic?

Technically there are logs (I've updated the header), just they are useless for us, as they are non-specific enough. A few issues pointing to (but not necessarily caused by) the IME happened in the past, there was never a clear reasoning, it just ended up working after a few firmware updates.

I don't think it is the OS, but I would just reimage and pray it doesn't happen again.

AFAIK based on the initial description the host was not responding to SSH and was powercycled. "No events logged in racadm getsel, no console com2 available (blank screen)".
I'll clone after the freeze and we'll see if anything happens again in future.

AFAIK based on the initial description the host was not responding to SSH and was powercycled. "No events logged in racadm getsel, no console com2 available (blank screen)".
I'll clone after the freeze and we'll see if anything happens again in future.

+1 let's clone and repool and let's see.

Preparing to clone from db1202 (db1202 is healthy and not candidate)

Started cloning db1202.eqiad.wmnet to db1253.eqiad.wmnet - fceratto@cumin1003

Completed depooling of db1202 by fceratto@cumin1003: Depool db1202.eqiad.wmnet to then clone it to db1253.eqiad.wmnet - fceratto@cumin1003

Starting pool of db1202 by fceratto@cumin1003: Pool db1202.eqiad.wmnet in after cloning

Completed pooling of db1202 by fceratto@cumin1003: Pool db1202.eqiad.wmnet in after cloning

Starting pool of db1253 by fceratto@cumin1003: Pool db1253.eqiad.wmnet in after cloning

Completed pooling of db1253 by fceratto@cumin1003: Pool db1253.eqiad.wmnet in after cloning

Finished cloning db1202.eqiad.wmnet to db1253.eqiad.wmnet - fceratto@cumin1003

FCeratto-WMF moved this task from Pending comment to Done on the DBA board.

Pooled in without issues, closing.

Change #1252002 abandoned by Elukey:

[operations/puppet@production] Disable notifications for db1253

https://gerrit.wikimedia.org/r/1252002