After enabling PSU checks in T109903 multiple systems residing in eqiad D8 have ipmi Power_Supply alerts. Are the PDUs in this rack healthy?
Description
Related Objects
Event Timeline
Thanks @Cmjohnson! This cleared 4 of the open alerts.
Oddly there are 3 hosts in this rack with open power supply alerts still. Is there any indication of a problem on the servers physically?
analytics1035
analytics1036
analytics1037
Strange. Analytics1035 has cleared while the other two are still in critical state. IPMI sel shows a few recent power events. I wonder what's going on with them.
analytics1036:~# ipmi-sel ID | Date | Time | Name | Type | Event 1 | May-03-2017 | 13:53:32 | SEL | Event Logging Disabled | Log Area Reset/Cleared 2 | Sep-19-2017 | 16:11:54 | Status | Power Supply | Power Supply input lost (AC/DC) 3 | Sep-19-2017 | 16:12:05 | Status | Power Supply | Power Supply Failure detected ; OEM Event Data2 code = 07h 4 | Oct-03-2017 | 18:32:29 | Status | Power Supply | Power Supply input lost (AC/DC)
analytics1037:~# ipmi-sel ID | Date | Time | Name | Type | Event 1 | May-03-2017 | 13:52:52 | SEL | Event Logging Disabled | Log Area Reset/Cleared 2 | Sep-19-2017 | 16:10:49 | Status | Power Supply | Power Supply input lost (AC/DC) 3 | Sep-19-2017 | 16:11:02 | Status | Power Supply | Power Supply Failure detected ; OEM Event Data2 code = 07h 4 | Oct-03-2017 | 18:31:17 | Status | Power Supply | Power Supply input lost (AC/DC)
Odd, the racadm log (Dell's hardware log) shows that the power was restored
and the physical connections shows that there is power.
root@analytics1037.mgmt.eqiad.wmnet's password:
/admin1-> racadm getsel
Record: 1
Date/Time: 05/03/2017 13:52:52
Source: system
Severity: Ok
Description: Log cleared.
Record: 2
Date/Time: 09/19/2017 16:10:49
Source: system
Severity: Critical
Description: The power input for power supply 1 is lost.
Record: 3
Date/Time: 09/19/2017 16:11:02
Source: system
Severity: Critical
Description: Fan failure detected on power supply 1.
Record: 4
Date/Time: 10/03/2017 18:31:17
Source: system
Severity: Ok
Description: The input power for power supply 1 has been restored.
/admin1->
root@analytics1036.mgmt.eqiad.wmnet's password:
/admin1-> racadm getsel
Record: 1
Date/Time: 05/03/2017 13:53:32
Source: system
Severity: Ok
Description: Log cleared.
Record: 2
Date/Time: 09/19/2017 16:11:54
Source: system
Severity: Critical
Description: The power input for power supply 1 is lost.
Record: 3
Date/Time: 09/19/2017 16:12:05
Source: system
Severity: Critical
Description: Fan failure detected on power supply 1.
Record: 4
Date/Time: 10/03/2017 18:32:29
Source: system
Severity: Ok
Description: The input power for power supply 1 has been restored.
/admin1->
@herron The server is out of warranty but I took a PSU from a decom server and replaced psu1 on analytics1037 and I no longer see the error. Take a look and if all is okay please resolve this task.
@Cmjohnson, both analytics1036 and analytics1037 are still showing PSU redundancy errors. analytics1035 is fine now, though.
They all have the same problem. I swapped PSU's for both an1036 and 1037 yesterday but still show the failure. The new psu's are failing after a new one is installed. These servers are now out of warranty.
replaced both psu's in analtyics1037, the psu in an1036 cleared the error and is functioning normally as far as I can tell.