Page MenuHomePhabricator

cr2-eqdfw: PEM 1 Input Voltage Out Of Range flapping
Closed, ResolvedPublic

Description

Since around Oct 16th, the following alarm has been flapping on cr2-eqdfw:

Alarm set: PS SFXPC color=RED, class=CHASSIS, reason=PEM 1 Input Voltage Out Of Range
Alarm cleared: PS SFXPC color=RED, class=CHASSIS, reason=PEM 1 Input Voltage Out Of Range

My guess is that the PEM should be replaced, so maybe it's as easy as a JTAC case, but they might put the blame on the power feed. In that case is it possible to check the power feed?

That should be treated as a loss of redundancy as this PEM can't be trusted anymore.

Event Timeline

ayounsi triaged this task as Medium priority.Oct 21 2021, 11:35 AM
ayounsi created this task.
This comment has been deleted.

I spent too long trying to find how to monitor the supply voltage but it doesn't seem to be possible?

The PSU output voltage does show in "show chassis env pem", on the MX204s anyway, but I can't see this is available via SNMP. JUNIPER-MIB::jnxOperatingFRUPower returns 0 for the two PEM modules.

Be nice to have these stats graphed but it doesn't seem possible :(

papaul@cr2-eqdfw> show chassis environment
Power PEM 0                          OK         35 degrees C / 95 degrees F
      PEM 1                          OK         32 degrees C / 89 degrees F
PEM 0 status:
  State                      Online
  Airflow                    Front to Back
  Temperature                OK   35 degrees C / 95 degrees F
  Temperature                OK   32 degrees C / 89 degrees F
  Firmware version           00.05
  Fan Sensor                 5940 RPM
  DC Output           Voltage(V) Current(A)  Power(W)  Load(%)
                        11.00       7             77       11
PEM 1 status:
  State                      Online
  Airflow                    Front to Back
  Temperature                OK   32 degrees C / 89 degrees F
  Temperature                OK   33 degrees C / 91 degrees F
  Firmware version           00.05
  Fan Sensor                 5940 RPM
  DC Output           Voltage(V) Current(A)  Power(W)  Load(%)
                        11.00       7

It was maybe just a temporary power feed issue. I will check the router again next week and see if all looks ago.

@ayounsi I checked the status of both PEM today. looks good to me. Do you want to to close the task

PEM 0 status:
  State                      Online
  Airflow                    Front to Back
  Temperature                OK   35 degrees C / 95 degrees F
  Temperature                OK   33 degrees C / 91 degrees F
  Firmware version           00.05
  Fan Sensor                 6000 RPM
  DC Output           Voltage(V) Current(A)  Power(W)  Load(%)
                        11.00       7             77       11
PEM 1 status:
  State                      Online
  Airflow                    Front to Back
  Temperature                OK   33 degrees C / 91 degrees F
  Temperature                OK   34 degrees C / 93 degrees F
  Firmware version           00.05
  Fan Sensor                 5910 RPM
  DC Output           Voltage(V) Current(A)  Power(W)  Load(%)
                        11.00       7             77       11

Looks like it's still alerting:

cr2-eqdfw> show system alarms 
1 alarms currently active
Alarm time               Class  Description
2021-10-21 22:36:50 UTC  Major  PEM 1 Input Voltage Out Of Range

I'd suggest to follow up with JTAC.

Case open with Juniper
case #: 2021-1025-348302

Hi Papaul,


Please let me know if we can have someone re-seating the PEM 1.  Let me know if re-seat clears the alarm. If this does not clear the alarm, then please provide requested details to process the RMA.

Equinix Ticket #1-213247924142 submitted to reseat the power supply. Thanks, Willy

re-seat of PEM1 by remote hands didn't clear the alarm. I will get back with Juniper to request a RMA.

Replacement part shipped. RMA below

Your replacement part associated with RMA R200378121 Item # 100 has been successfully shipped. Details of which are provided below.

REplacing PEM1 didn't clear the alarm, PEM1 is not the issue. I unplugged PEM1 and plugged it into the same PDU with PEM0 it clears the alarm so i am thinking that the PDU where PEM1 is plug in is bad on something with where the PDU itself is plug in (Equinix side) .

The 2 PDU's are not tracked in Netbox. I found the setup task of eqord and eqdfw where the PDU's are listed and the task where the PDU's were ordered. Please see below
https://phabricator.wikimedia.org/T91077
https://rt.wikimedia.org/Ticket/Display.html?id=8761)

RobH mentioned this in Unknown Object (Task).Nov 1 2021, 11:59 PM

I create Order Number - 1-213500699180 to ask Equinix to check the PDU and the power to that PDU and let us know if those PDU's belong to us or not.

Below is the reply from Equinix

Site engineers have determined both power supplies are online from the original source and the PDUs belong to Wikimedia.

After putting in the new PDU's we still have the same problem.

I create Order # 1-214270167279 to be on site next week on the 14th at 3:00 PM to meet with the Equinix smart hands tech to perform the troubleshooting while i am on site.

The was a breaker problem . This is now resolved