Page MenuHomePhabricator

Rack A2's hosts alarm for PSU broken
Closed, ResolvedPublic

Description

Hi!

Today is PSU failure day, after kafka1013 I got the following broken PSUs:

  • an-worker1078
-------------------------------------------------------------------------------
Record:      2
Date/Time:   01/03/2019 14:40:12
Source:      system
Severity:    Critical
Description: The power input for power supply 1 is lost.
-------------------------------------------------------------------------------
Record:      3
Date/Time:   01/03/2019 14:40:13
Source:      system
Severity:    Critical
Description: Power supply redundancy is lost.
-------------------------------------------------------------------------------
  • an-worker1079
-------------------------------------------------------------------------------
Record:      2
Date/Time:   01/03/2019 14:39:02
Source:      system
Severity:    Critical
Description: Power supply redundancy is lost.
-------------------------------------------------------------------------------
Record:      3
Date/Time:   01/03/2019 14:39:02
Source:      system
Severity:    Critical
Description: The power input for power supply 2 is lost.
-------------------------------------------------------------------------------

Other hosts in A2 are reporting the same issue, so it is likely a rack level problem :)

Affected hosts:

  • an-worker1078
  • an-worker1079
  • cloudelastic1001
  • db1082
  • db1107
  • ms-be1044
  • ms-be1045

Event Timeline

elukey triaged this task as High priority.Jan 3 2019, 3:14 PM
elukey created this task.

Judging by icinga there's a few other hosts with PS alerts, all in A2. I suspect it has to do with one of the rack PDU themselves

cloudelastic1001
db1082
db1107
ms-be1044
ms-be1045
elukey renamed this task from PSU broken on two Analytics Hadoop Workers to Rack A2's hosts alarm for PSU broken.Jan 3 2019, 3:20 PM
elukey updated the task description. (Show Details)

I replaced the fuse on the wrong side initially and caused an outage. I then replaced the fuses on the correct phase and the power was not restored, I tried replacing them both a second time and still nothing. I am out of spare fuses, I do have a spare PDU but we are wanting to replace the PDU and there is an order request for a new set.

  1. Do we want to leave these servers with non-redundant power until we can replace the PDU with a new one that should be ordered soon?
  2. Do you want me to replace the PDU with the spare but will ultimately have to change it in the near future?
  1. Do we want to leave these servers with non-redundant power until we can replace the PDU with a new one that should be ordered soon?
  2. Do you want me to replace the PDU with the spare but will ultimately have to change it in the near future?

So the PDU order is on T210776, but it had a mistake. Dell listed out the SMART not SWITCHED PDUs and I did not catch it. I've sent back to Dell to get it corrected, and once we have the timeline I think we'll know if we want to go with 1 or 2.

Since the A2 PDU is a 24 port per tower at present, and we're having issues, I think a swap is in order, just the question of new or old spare will be answered later today when we have a timeline for the order on T210776.

I am creating a subtask to fix db1082, which may have to be reimaged because the power loss.

^CC @Marostegui so you know why db1082 + db1124 + labsdb replication (s5) are broken or stopped

So I asked for an update for the quote on T210776 and nothing yet. Dell acknowledges they received and are working on it.

If we do not have a quote back today, I'd recommend swapping this PDU out for one of the spare 48 port PDUs Chris mentioned he has spare.

I rebuilt db1082- we are no blocker for any maintenance on those servers, but we would prefer to stop mysql if there is a chance for the server to lose power, while it does not cause any user-visible outage, as it is very time consuming for us to recover a pooled server, and takes very little time to depool it and stop it.

The fuse for the PDU on A2 has been replaced all power is restored.