lvs4002 power supply failure
Closed, ResolvedPublic

Description

ulsfo had a power strip go bad, and it turns out that either lvs4002 or cp4008 are responsible.

This task will track the repair/replacement of the power supply in lvs4002. When the UL techs were plugging it back in (after replacing the bad power strip in the cabinet), lvs4002 popped and then went out. It was likely a secondary failure, with cp4008 causing the outage, but its hard to be certain.

RobH created this task.Nov 22 2016, 12:09 AM
Restricted Application added subscribers: Southparkfan, Aklapper. · View Herald TranscriptNov 22 2016, 12:09 AM

Mentioned in SAL (#wikimedia-operations) [2016-11-22T00:45:04Z] <bblack> cr[12]-ulsfo - added metric 15 to lvs4002 in policy LVS_import - T151273

Mentioned in SAL (#wikimedia-operations) [2016-11-22T00:47:29Z] <bblack> cr[12]-ulsfo - added metric 15 to lvs4002 in policy LVS_import (for real this time) - T151273

BBlack added a subscriber: BBlack.Nov 22 2016, 12:49 AM

Above was this on both ulsfo routers:

set policy-options policy-statement LVS_import term lvs4002_T151273 from protocol bgp neighbor 10.128.0.12
set policy-options policy-statement LVS_import term lvs4002_T151273 then metric add 15 
insert policy-options policy-statement LVS_import term lvs4002_T151273 before term service_IPs

Existing metric was +10 for secondaries, so +15 here makes lvs4002 less-preferable than lvs4004, and we can just remove the term when this is fixed.

BBlack added a subscriber: faidon.

This system is out of warranty.

RobH added a comment.Nov 28 2016, 3:55 PM

Since both this and recently died power supply on cp4008 are out of warranty, the current plan is to steal the other power supply from cp4008 to replace the bad one in lvs4002.

faidon assigned this task to RobH.Jan 9 2017, 1:18 AM

@RobH what's the status of this?

RobH added a comment.Jan 13 2017, 8:04 PM

I've not gone on-site to do this yet, it seemed lower priority than my tasks at the time.

I'll plan to drive into ULSFO during the week next week to swap power supplies around and offline cp4008 entirely.

Alternatively, I could steal power supplies from other cp systems and have two cp systems without redundant power, rather than one less cp system overall.
@faidon: Is there a preference for this? Are we better off losing a CP system entirely, or having two run with non-redundant power? Preference?

RobH closed this task as Resolved.Jan 20 2017, 7:33 PM

Ok, it seems better to have two cp systems with a single PSU each than lose one entirely. I double checked and @BBlack agreed with that.

So now cp4012 has one less power supply (it was located at the top of the stack, easiest one to remove a psu from without touching other sysytems) and lvs2004 has redundant power.

I'm going to note in racktables that cp4012 has a problem and is down a single PSU.

RobH reopened this task as Open.Jan 24 2017, 5:21 PM

I'm reopening this.

LVS4002 had its power supply fail again, the exact same PSU slot that died before, PSU2.

I had taken another power supply out of cp4012 and moved it into lvs4002. lvs4002 then today had another power failure, of psu2. So either the PDU has a bad power port, or something is wrong with the system board/power controller board in lvs4002. Since the system is out of warranty, our options for repair are simply replacing the PSU. Since that didn't work last time, I'd recommend that we replace the lvs system entirely.

This last power outage on the PDU (the PDU is provided by UnitedLayer) seems to have been tripped by lvs4002's second psu failure. This last failure tripped the breaker on the PDU, as well as the power panel behind the PDU.

@BBlack suggests that it could be the PDU causing the issues. I'm not sure how we would go about testing that. If we replace the PSU on lvs4002 a second time, I'll swap the power cable and power port it plugs into on the PDU. If we get another failure then, its less likely to be the PDU failing on multiple power ports (imo).

RobH added a comment.Jan 24 2017, 5:29 PM

So when we get the replacement power supplies mentioned on T156154, we should move the power ports used by lvs4002 with another system. Then if the other system has a psu failure, we'll know its the pdu's power port, not our psus.

RobH added a comment.Jan 24 2017, 6:11 PM

I asked Chris if we had any decommissioned R620s in eqiad so we can steal power supplies, but we do not.

We do not have any decommissioned R620s in eqiad.

RobH added a comment.EditedFeb 16 2017, 8:47 PM

Ok, I took a redundant supply from cp4007 and installed it into lvs4002 power supply 2 slot. Less than a minute later, the system killed the new power supply.

Record:      1022
Date/Time:   02/16/2017 20:41:59
Source:      system
Severity:    Ok
Description: The power supplies are redundant.
-------------------------------------------------------------------------------
Record:      1023
Date/Time:   02/16/2017 20:42:04
Source:      system
Severity:    Critical
Description: The power input for power supply 2 is lost.
-------------------------------------------------------------------------------
Record:      1024
Date/Time:   02/16/2017 20:42:14
Source:      system
Severity:    Critical
Description: Power supply redundancy is lost.

I'd advise we phase out lvs4002, since its out of warranty and has killed 3 power supplies (one of its own, and then two from other cp systems we stole from to fix lvs4002.)

RobH reassigned this task from RobH to BBlack.Feb 17 2017, 9:45 PM

I'm assigning this task to Brandon for followup. In IRC, we discussed that he would likely fail ulsfo over to a 3 lvs system setup. I'm not sure if there is a tracking task associated with that or not. If you have another task for that, please reference it here, as I'll make it a blocker for a new task for decommissioning lvs4002.

Once that is done a task should be created that references this task (and whatever task is used to track the migration of ulsfo to 3 lvs system setup) to decommission lvs4002.

BBlack moved this task from Triage to Hardware on the Traffic board.
BBlack closed this task as Resolved.Oct 23 2017, 4:21 PM

At this point, we'll just do the new 3-server setup on the new lvs400[567] systems in T178436 and ignore this until decom, basically.