Page MenuHomePhabricator

cp2035 IPMI and management console issues
Closed, ResolvedPublic

Description

Debugging an IPMI alert on cp2035, I've found out that ipmiseld it's been logging the following error since 2023-03-27 22:26:05:

Mar 27 22:26:05 cp2035 ipmiseld[1620416]: ipmi_sdr_cache_open: internal IPMI error

management console looks also unreachable via SSH

Event Timeline

Vgutierrez triaged this task as Medium priority.Mar 28 2023, 9:04 AM

Mentioned in SAL (#wikimedia-operations) [2023-03-28T09:41:04Z] <vgutierrez> resetting cp2035 management card - T333312

Unable to reset the management card:

root@cp2035:~# bmc-device --cold-reset; echo $?
ipmi_cmd_cold_reset: driver timeout
1

Icinga downtime and Alertmanager silence (ID=07b8190f-1479-43ea-ba98-63f852f30e9e) set by vgutierrez@cumin1001 for 2 days, 0:00:00 on 1 host(s) and their services with reason: HW issues

cp2035.codfw.wmnet
Jhancock.wm claimed this task.
Jhancock.wm subscribed.

confirmed with Sukhe that it was depoooled. worked remotely with Papaul to update the idrac and the bios.

Thanks @Jhancock.wm for the fix! I can confirm the host has been resolved. For posterity: repooling the host.