Page MenuHomePhabricator

cloudvirt1033 psu redundancy alert
Closed, ResolvedPublic

Description

"Sensor Type(s) Temperature, Power_Supply Status: Critical [PS Redundancy = Critical, Status = Critical]"

cloudvirt1033 has had a failed PSU redundancy since 2020-09-17. A cable may have knocked loose during unrelated on-site work.

Sensor Type : POWER
<Sensor Name>                   <Status>                 <Type>         
PS1 Status                      AC-Lost                  AC             
PS2 Status                      Present                  AC

https://icinga.wikimedia.org/cgi-bin/icinga/extinfo.cgi?type=2&host=cloudvirt1033&service=IPMI+Sensor+Status

Once it is re-seated and the alert clears, this task can be resolved.

Event Timeline

wiki_willy added subscribers: Cmjohnson, wiki_willy.

Hi @Cmjohnson - based on Netbox, looks like cloudvirt1033 is in rack C8, which wasn't part of the PDU upgrades these past weeks. Hopefully it's just be a cable that got knocked loose at some point, and an easy fix. Thanks, Willy

This just alerted again. I'll downtime it if I can make my laptop work right.

Wait the alert may have been the old acked alert re-alerting in VictorOps. I will resolve it in victorops. The alert is still red in icinga, but it is acked so should not re-alert soon?

Wait the alert may have been the old acked alert re-alerting in VictorOps. I will resolve it in victorops. The alert is still red in icinga, but it is acked so should not re-alert soon?

Confirmed that's what it was. I have resolved it so it doesn't do that again.

RobH renamed this task from cloudvirt1033 ipmi alert to cloudvirt1033 psu redundancy alert.Sep 22 2020, 4:45 PM
RobH triaged this task as Medium priority.
RobH updated the task description. (Show Details)
RobH moved this task from Backlog to Hardware Failure / Troubleshoot on the ops-eqiad board.

I cleared the log on idrac, I did have to remove power to it ...rack C8 power was all messed up, nothing was correct, and moving a server into that rack required me to fix it immediately. Sometimes, removing the redundant power will cause the alert. Let's see if clearing the log entry removes it completely.

The issue seems to have been resolved.

this is still red in icinga: "Sensor Type(s) Temperature, Power_Supply Status: Critical [PS Redundancy = Critical, Status = Critical"

I submitted a ticket with Dell for a new power supply

You have successfully submitted request SR1038434287.

Dell denied my request for the part, somehow it was only ordering outside of the US. I will need to calll them

Called to open a ticket with Dell, they received the information and the TSR and are sending a new part

New PSU arrived and swapped. System reports healthy.