Page MenuHomePhabricator

Check analytics1037 power supply status
Closed, ResolvedPublic

Description

elukey@analytics1037:~$ sudo ipmi-sel
ID  | Date        | Time     | Name             | Type                     | Event
1   | Oct-18-2017 | 14:51:40 | SEL              | Event Logging Disabled   | Log Area Reset/Cleared
2   | Oct-20-2017 | 15:45:40 | Status           | Power Supply             | Power Supply input lost (AC/DC)
3   | Oct-20-2017 | 15:45:50 | Status           | Power Supply             | Power Supply input lost (AC/DC)
4   | Oct-20-2017 | 15:46:21 | Status           | Power Supply             | Power Supply input lost (AC/DC)
5   | Oct-20-2017 | 15:46:31 | Status           | Power Supply             | Power Supply input lost (AC/DC)

Event Timeline

Icinga is reporting it as critical:

Sensor Type(s) Temperature, Power_Supply Status: Critical [PS Redundancy = Critical, Status = Critical, Status = Critical]
'Inlet Temp'=21.00;3.00:42.00;-7.00:47.00 'Exhaust Temp'=45.00;8.00:70.00;3.00:75.00 'Temp'=64.00 'Temp'=55.00

I've ack'ed the alarm with the link to this task.

I am not sure what else can be done here. I’ve replaced the PSUs twice,
upgraded f/w and they continue to burn through the fans. This group of
servers is almost a year out of warranty.

I replaced with used power supplies. If you want to try new, we will need
to request purchasing a new part in a procurement task

@Cmjohnson let's order new PSUs if possible, we are not planning to replace this hardware soon :(

Cmjohnson added a subscriber: RobH.

this server is out of warranty by 6 months. Assigning to @RobH to determine if we should order a new one...probably two.

@Cmjohnson: Do we have any power supplies on already decommissioned hardware that would fit in the system with the failed powersupply?

I think replacing bad powersupplies on out of warranty servers is likely a waste of money (as other parts will also go bad with older systems), however I've emailed asking Dell the price per power supply. (Will update task when they reply back with a quote, please note the quote task will have to be a procurement sub task, as we don't discuss pricing or quotations in public tasks.)

Ideally we use out of warranty decommissioned hardware, and pull power supplies from those.

On the server, the LED indicator for the power supply is green and not showing any signs of a problem. I even removed power from each side to ensure that the 2nd PSU would continue to power the server on and the test worked without any downtime. The racadm log does not show any sign of a problem and in my opinion the actual racadm software is a better indicator for h/w issues on the server than ipmi. I have pasted the racadm log and ipmi log. I think the ipmi may be giving a false alarm. The lost/restore log entries is me removing the psu's.

/admin1-> racadm getsel
Record: 1
Date/Time: 10/18/2017 14:51:40
Source: system
Severity: Ok

Description: Log cleared.

Record: 2
Date/Time: 10/20/2017 15:45:40
Source: system
Severity: Critical

Description: The power input for power supply 1 is lost.

Record: 3
Date/Time: 10/20/2017 15:45:50
Source: system
Severity: Ok

Description: The input power for power supply 1 has been restored.

Record: 4
Date/Time: 10/20/2017 15:46:21
Source: system
Severity: Critical

Description: The power input for power supply 2 is lost.

Record: 5
Date/Time: 10/20/2017 15:46:31
Source: system
Severity: Ok

Description: The input power for power supply 2 has been restored.

cmjohnson@analytics1037:~$ sudo ipmi-sel
ID | Date | Time | Name | Type | Event
1 | Oct-18-2017 | 14:51:40 | SEL | Event Logging Disabled | Log Area Reset/Cleared
2 | Oct-20-2017 | 15:45:40 | Status | Power Supply | Power Supply input lost (AC/DC)
3 | Oct-20-2017 | 15:45:50 | Status | Power Supply | Power Supply input lost (AC/DC)
4 | Oct-20-2017 | 15:46:21 | Status | Power Supply | Power Supply input lost (AC/DC)
5 | Oct-20-2017 | 15:46:31 | Status | Power Supply | Power Supply input lost (AC/DC)

Tried to check the ipmi command that the icinga check calls:

elukey@analytics1037:/var/log$ sudo ipmimonitoring -v  | grep -i power
83  | PS Redundancy    | Power Supply             | Critical | N/A        | N/A   | 'Redundancy Lost'
84  | Status           | Power Supply             | Critical | N/A        | N/A   | 'Presence detected' 'Power Supply Failure detected'
85  | Status           | Power Supply             | Critical | N/A        | N/A   | 'Presence detected' 'Power Supply Failure detected'
87  | Power Optimized  | OEM Reserved             | N/A      | N/A        | N/A   | 'OEM Event = 0001h'
136 | Power Cable      | Cable/Interconnect       | Nominal  | N/A        | N/A   | 'Cable/Interconnect is connected'
138 | Power Cable      | Cable/Interconnect       | Nominal  | N/A        | N/A   | 'Cable/Interconnect is connected'

Tried the --flush-cache option just in case but didn't work. Not sure if there is another way to figure out if ipmi is returning a false alarm or not, but given the status of the host (OOW and went under maintenance several times) I'd say that it is fine to close this task. The host is a regular Hadoop worker node without any special capabilities, even if it goes down abruptly it will not cause any harm.

It seems odd that the harware says its fine, but the software check doesn't. I'd rather we not close the task if its showing the alarm, but leave it open and stalled iwth the server slated for decommission. (Having systems throwing errors but not slated for decom seems to be asking for trouble.)

elukey changed the task status from Open to Stalled.Jan 2 2018, 4:07 PM
RobH triaged this task as Low priority.Feb 8 2018, 7:07 PM

This hasn't reoccured in a very long time, none since this task creation, resolving.

This server is going to be decommed very soon (OOW), I've acked the alarm a long time ago to avoid it spamming us. Good to close in my opinion, +1