Page MenuHomePhabricator

an-presto1018.eqiad.wmnet: DRAC is down
Closed, ResolvedPublic

Description

Hello DC Ops,

an-presto1018.eqiad.wmnet is reachable by SSH, but I can't reach its DRAC from our cumin hosts. We need DRAC access so we can reimage the host back to Bullseye (see parent ticket for details). I tried resetting the DRAC with bmc-device --cold-reset, but that didn't seem to help. Are you able to take a look? FWiW, this is low priority.

Feel free to ping me back on IRC (inflatador) if you need more info. Thanks for your help!

Event Timeline

bking updated the task description. (Show Details)
bking updated the task description. (Show Details)
bking updated the task description. (Show Details)

DC Ops,

Per IRC conversation in dc-ops channel , Cathal checked the network plumbing and everything looks good. bmc-info from the OS layer responds with the correct network info:

System Firmware Version       : 1.14.1
System Name                   :
Primary Operating System Name :
Operating System Name         :
Present OS Version Number     :
BMC URL                       : https://10.65.4.61:443

So it is looking like a physical layer issue, unfortunately.

Icinga downtime and Alertmanager silence (ID=c79896bf-7b1a-4996-a194-5ddd94c51f42) set by stevemunene@cumin1002 for 10 days, 0:00:00 on 1 host(s) and their services with reason: Downtimed for further troubleshooting possible Hardware failure

an-presto1018.eqiad.wmnet
Gehel triaged this task as High priority.Fri, Nov 8, 2:23 PM

I think that this is fixed now. I'm able to reimage an-presto1018 and connect to a SOL session, so I think we're all good. Thanks for your help.

Maybe I spoke too soon. I've had this error twice now, suggesting a failure to pull the boot image with TFTP, or similar.

image.png (726×1 px, 104 KB)

I'll check the NIC firmware version.

That worked, so we're all good.