Page MenuHomePhabricator

Multiple servers in eqiad D8 showing PSU failures
Closed, ResolvedPublic

Description

After enabling PSU checks in T109903 multiple systems residing in eqiad D8 have ipmi Power_Supply alerts. Are the PDUs in this rack healthy?

Event Timeline

faidon renamed this task from Multiple servers in equad D8 showing PSU failures to Multiple servers in eqiad D8 showing PSU failures.Oct 2 2017, 3:42 PM
faidon assigned this task to Cmjohnson.
faidon triaged this task as High priority.
faidon added a project: ops-eqiad.

//cc @jcrespo db1102 (sanitarium host) and es1019 (a slave) are there

The breaker had tripped for 1 of the phases on side A. Reset

Thanks @Cmjohnson! This cleared 4 of the open alerts.

Oddly there are 3 hosts in this rack with open power supply alerts still. Is there any indication of a problem on the servers physically?

analytics1035
analytics1036
analytics1037

@herron all power supplies are working for those 3 hosts

Strange. Analytics1035 has cleared while the other two are still in critical state. IPMI sel shows a few recent power events. I wonder what's going on with them.

analytics1036:~# ipmi-sel
ID  | Date        | Time     | Name             | Type                     | Event
1   | May-03-2017 | 13:53:32 | SEL              | Event Logging Disabled   | Log Area Reset/Cleared
2   | Sep-19-2017 | 16:11:54 | Status           | Power Supply             | Power Supply input lost (AC/DC)
3   | Sep-19-2017 | 16:12:05 | Status           | Power Supply             | Power Supply Failure detected ; OEM Event Data2 code = 07h
4   | Oct-03-2017 | 18:32:29 | Status           | Power Supply             | Power Supply input lost (AC/DC)
analytics1037:~# ipmi-sel
ID  | Date        | Time     | Name             | Type                     | Event
1   | May-03-2017 | 13:52:52 | SEL              | Event Logging Disabled   | Log Area Reset/Cleared
2   | Sep-19-2017 | 16:10:49 | Status           | Power Supply             | Power Supply input lost (AC/DC)
3   | Sep-19-2017 | 16:11:02 | Status           | Power Supply             | Power Supply Failure detected ; OEM Event Data2 code = 07h
4   | Oct-03-2017 | 18:31:17 | Status           | Power Supply             | Power Supply input lost (AC/DC)

Odd, the racadm log (Dell's hardware log) shows that the power was restored
and the physical connections shows that there is power.

root@analytics1037.mgmt.eqiad.wmnet's password:

/admin1-> racadm getsel

Record: 1

Date/Time: 05/03/2017 13:52:52

Source: system

Severity: Ok

Description: Log cleared.


Record: 2

Date/Time: 09/19/2017 16:10:49

Source: system

Severity: Critical

Description: The power input for power supply 1 is lost.


Record: 3

Date/Time: 09/19/2017 16:11:02

Source: system

Severity: Critical

Description: Fan failure detected on power supply 1.


Record: 4

Date/Time: 10/03/2017 18:31:17

Source: system

Severity: Ok

Description: The input power for power supply 1 has been restored.


/admin1->

root@analytics1036.mgmt.eqiad.wmnet's password:

/admin1-> racadm getsel

Record: 1

Date/Time: 05/03/2017 13:53:32

Source: system

Severity: Ok

Description: Log cleared.


Record: 2

Date/Time: 09/19/2017 16:11:54

Source: system

Severity: Critical

Description: The power input for power supply 1 is lost.


Record: 3

Date/Time: 09/19/2017 16:12:05

Source: system

Severity: Critical

Description: Fan failure detected on power supply 1.


Record: 4

Date/Time: 10/03/2017 18:32:29

Source: system

Severity: Ok

Description: The input power for power supply 1 has been restored.


/admin1->

I wonder if the ipmi issue is related to fan failure on the power supply?

@herron The server is out of warranty but I took a PSU from a decom server and replaced psu1 on analytics1037 and I no longer see the error. Take a look and if all is okay please resolve this task.

@Cmjohnson, both analytics1036 and analytics1037 are still showing PSU redundancy errors. analytics1035 is fine now, though.

They all have the same problem. I swapped PSU's for both an1036 and 1037 yesterday but still show the failure. The new psu's are failing after a new one is installed. These servers are now out of warranty.

@herron @faidon I updated the f/w on both servers and the issue has been resolved.

replaced both psu's in analtyics1037, the psu in an1036 cleared the error and is functioning normally as far as I can tell.