Page MenuHomePhabricator

cr2-esams FPC 0 is dead
Closed, ResolvedPublic

Description

cr2-esams' FPC 0 failed today:

Apr  5 06:32:01  re0.cr2-esams chassisd[1614]: CHASSISD_FASIC_HSL_LINK_ERROR: Fchip (CB 0, ID 0): link 32 failed because of crc errors
Apr  5 06:32:01  re0.cr2-esams chassisd[1614]: New CRC errors found on xfchip 0 plane 0 subport 32 xfport 8 new_count 65535 aggr_count 65535
Apr  5 06:32:01  re0.cr2-esams chassisd[1614]: CHASSISD_FASIC_HSL_LINK_ERROR: Fchip (CB 0, ID 0): link 33 failed because of crc errors
Apr  5 06:32:01  re0.cr2-esams chassisd[1614]: Link failure happened for DPC0 PFE0
Apr  5 06:32:01  re0.cr2-esams chassisd[1614]: CHASSISD_SNMP_TRAP7: SNMP trap generated: fabric plane check (jnxFruContentsIndex 12, jnxFruL1Index 1, jnxFruL2Index 0, jnxFruL3Index 0, jnxFruName CB 0, jnxFruType 5, jnxFruSlot 0)
[...]
Apr  5 06:32:07  re0.cr2-esams craftd[1616]: Minor alarm cleared, Check CB 0 Fabric Chip 0
Apr  5 06:32:07  re0.cr2-esams chassisd[1614]: CHASSISD_SNMP_TRAP7: SNMP trap generated: fabric plane online (jnxFruContentsIndex 12, jnxFruL1Index 1, jnxFruL2Index 0, jnxFruL3Index 0, jnxFruName CB 0, jnxFruType 5, jnxFruSlot 0)
Apr  5 06:32:07  re0.cr2-esams alarmd[1788]: Alarm cleared: CB color=YELLOW, class=CHASSIS, reason=Check CB 0 Fabric Chip 1
Apr  5 06:32:07  re0.cr2-esams chassisd[1614]: CHASSISD_SNMP_TRAP7: SNMP trap generated: fabric plane online (jnxFruContentsIndex 12, jnxFruL1Index 1, jnxFruL2Index 0, jnxFruL3Index 0, jnxFruName CB 0, jnxFruType 5, jnxFruSlot 0)
Apr  5 06:32:07  re0.cr2-esams alarmd[1788]: Alarm cleared: CB color=YELLOW, class=CHASSIS, reason=Check CB 1 Fabric Chip 0
Apr  5 06:32:07  re0.cr2-esams chassisd[1614]: CHASSISD_SNMP_TRAP7: SNMP trap generated: fabric plane online (jnxFruContentsIndex 12, jnxFruL1Index 2, jnxFruL2Index 0, jnxFruL3Index 0, jnxFruName CB 1, jnxFruType 5, jnxFruSlot 1)
Apr  5 06:32:07  re0.cr2-esams alarmd[1788]: Alarm cleared: CB color=YELLOW, class=CHASSIS, reason=Check CB 1 Fabric Chip 1
Apr  5 06:32:07  re0.cr2-esams chassisd[1614]: CHASSISD_SNMP_TRAP7: SNMP trap generated: fabric plane online (jnxFruContentsIndex 12, jnxFruL1Index 2, jnxFruL2Index 0, jnxFruL3Index 0, jnxFruName CB 1, jnxFruType 5, jnxFruSlot 1)
Apr  5 06:32:07  re0.cr2-esams craftd[1616]: Minor alarm cleared, Check CB 0 Fabric Chip 1
Apr  5 06:32:07  re0.cr2-esams /kernel: pfe_listener_disconnect: conn dropped: listener idx=0, tnpaddr=0x10000080, reason: offlined by chassisd
Apr  5 06:32:07  re0.cr2-esams craftd[1616]: Minor alarm cleared, Check CB 1 Fabric Chip 0
Apr  5 06:32:07  re0.cr2-esams craftd[1616]: Minor alarm cleared, Check CB 1 Fabric Chip 1
Apr  5 06:32:07  re0.cr2-esams chassisd[1614]: CHASSISD_SNMP_TRAP10: SNMP trap generated: Fru Offline (jnxFruContentsIndex 7, jnxFruL1Index 1, jnxFruL2Index 0, jnxFruL3Index 0, jnxFruName FPC: MPC5E 3D 24XGE+6XLGE @ 0/*/*, jnxFruType 3, jnxFruSlot 0, jnxFruOfflineReason 74, jnxFruLastPowerOff 0, jnxFruLastPowerOn 1792)
Apr  5 06:32:13  re0.cr2-esams chassisd[1614]: fru_nmi_timer: Restart FPC 0 due to NMI timeout
Apr  5 06:32:24  re0.cr2-esams chassisd[1614]: CHASSISD_SNMP_TRAP10: SNMP trap generated: FRU power on (jnxFruContentsIndex 7, jnxFruL1Index 1, jnxFruL2Index 0, jnxFruL3Index 0, jnxFruName FPC: MPC5E 3D 24XGE+6XLGE @ 0/*/*, jnxFruType 3, jnxFruSlot 0, jnxFruOfflineReason 2, jnxFruLastPowerOff 0, jnxFruLastPowerOn 158345539)
Apr  5 06:32:34  re0.cr2-esams chassisd[1614]: CHASSISD_POWER_CHECK: FPC 0 not powering up

show chassis fpc shows it as either:

faidon@re0.cr2-esams> show chassis fpc    
                     Temp  CPU Utilization (%)   Memory    Utilization (%)
Slot State            (C)  Total  Interrupt      DRAM (MB) Heap     Buffer
  0  Offline         ---Unresponsive---

or

faidon@re0.cr2-esams> show chassis fpc 
                     Temp  CPU Utilization (%)   Memory    Utilization (%)
Slot State            (C)  Total  Interrupt      DRAM (MB) Heap     Buffer
  0  Present          Absent

I tried both the request chassis fpc slot 0 offline & …online dance, as well as set chassis fpc 0 power off & rollback to kill its power, to no unavail.

This needs a Juniper case (which will probably result into an RMA).

Event Timeline

Juniper case 2017-0405-0571 opened.

Juniper is ready to proceed with an RMA. We need to sync up with the DC's remote hands for that.

I've emailed evoswitch to open an inbound shipment ticket. Once I have that reference, I'll update this task so @ayounsi can have Juniper dispatch the replacement part.

Step by step instructions for the remote hands:

  1. Locate the chassis: http://www.juniper.net/techpubs/en_US/release-independent/junos/topics/concept/mpc-mx480-description.html
  2. Locate the faully MPC on the chassis, should look like https://www.juniper.net/documentation/en_US/release-independent/junos/topics/reference/general/mpc5e-6x40ge-24x10ge.html
  3. Unpack the newly received part.
  4. Confirm that it is similar to the installed part.
  5. Label the cables with their locations
  6. Unplug all cables from the linecard.
  7. Remove the linecard from the chassis
  8. Install the new linecard (MPC).
  9. Wait a few minutes
  10. Verify the led status: OK/FAIL LED, one bicolor: Steady green—MPC is functioning normally. Blinking green—MPC is transitioning online or offline. Red—MPC has failed
  11. Connect the cables back to the same emplacements
  12. Verify blinking link lights on the interfaces
  13. Follow the instruction provided with the new part to return the broken one to Juniper

Inbound ticket # is 7326745, please go ahead and have them dispatch the part. Update this task with the tracking # and assign to me, and I'll get the inbound ticket updated.

From Juniper:

Thank you for the information on provided, and the RMA request has been processed for the FPC:

  • RMA number: R200119594
  • Product ID: MPC5E-40G10G

Logistics will share with you the RMA details soon, feel free to contact them, or me if you have any concerns about the process.

I'm also being CC'd on those emails from Juniper. Once they reply back with the tracking #, I'll update EvoSwitch for the open shipment ticket and open the ticket for the smart hands request.

From Juniper at about 9am UTC:

Sent by Carrier: UPS
Tracking Number: 1Z223V170461615001
Tracking URL: http://wwwapps.ups.com/WebTracking/track?track=yes&trackNums=1Z223V170461615001
Service Level: NEXT BUSINESS DAY ADV REP PART DELIVERY

@RobH please update EvoSwitch with this information.

Thanks

Done, I've also asked evoswitch support about the followup part swap, and how we can arrange it.

We just saw a icinga recovery for cr2-esams, and now mtr from netmon1001 to cp3008 for example goes through cr2-eqiad -> cr2-esams.

Some 503s were registered while probably VRRP shifted the primary route / default gw to cr2:

10:45  <icinga-wm> RECOVERY - Router interfaces on cr2-eqiad is OK: OK: host 208.80.154.197, interfaces up: 214, down: 0, dormant: 0, excluded: 0, unused: 0

10:45  <icinga-wm> RECOVERY - Host cr2-esams is UP: PING OK - Packet loss = 0%, RTA = 84.73 ms

10:49  <icinga-wm> PROBLEM - Upload HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 10.00% of data above the critical threshold [1000.0]                                    

10:50  <icinga-wm> RECOVERY - Host cr2-esams IPv6 is UP: PING OK - Packet loss = 0%, RTA = 85.76 ms

10:50  <icinga-wm> PROBLEM - Esams HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 10.00% of data above the critical threshold [1000.0]                                     

10:51  <icinga-wm> PROBLEM - Text HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0]

10:57  <icinga-wm> RECOVERY - Text HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0]

10:57  <icinga-wm> RECOVERY - Upload HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0]

Was it expected to be fixed today? Pretty sure it can't have auto-recovered but it was a bit unexpected :)

from cr2-esams show log messages:

Apr  8 10:44:00  re0.cr2-esams fpc0 CLKSYNC: Transitioned to centralized mode
Apr  8 10:44:03  re0.cr2-esams fpc0 I2C Failed device: group 0x52 address 0x3f
Apr  8 10:44:04  re0.cr2-esams chassisd[1614]: CHASSISD_SNMP_TRAP10: SNMP trap generated: FRU power on (jnxFruContentsIndex 8, jnxFruL1Index 1, jnxFruL2Index 2, jnxFruL3Index 0, jnxFruName PIC:  @ 0/1/*, jnxFruType 11, jnxFruSlot 0, jnxFruOfflineReason 2, jnxFruLastPowerOff 185759114, jnxFruLastPowerOn 185775601)
Apr  8 10:44:04  re0.cr2-esams chassisd[1614]: CHASSISD_SNMP_TRAP10: SNMP trap generated: FRU power on (jnxFruContentsIndex 8, jnxFruL1Index 1, jnxFruL2Index 3, jnxFruL3Index 0, jnxFruName PIC:  @ 0/2/*, jnxFruType 11, jnxFruSlot 0, jnxFruOfflineReason 2, jnxFruLastPowerOff 185759114, jnxFruLastPowerOn 185775601)
Apr  8 10:44:21  re0.cr2-esams chassisd[1614]: CHASSISD_IFDEV_CREATE_NOTICE: create_pics: created interface device for xe-0/1/0
Apr  8 10:44:21  re0.cr2-esams chassisd[1614]: CHASSISD_IFDEV_CREATE_NOTICE: create_pics: created interface device for xe-0/1/1
Apr  8 10:44:21  re0.cr2-esams chassisd[1614]: CHASSISD_IFDEV_CREATE_NOTICE: create_pics: created interface device for xe-0/1/2
Apr  8 10:44:21  re0.cr2-esams chassisd[1614]: CHASSISD_IFDEV_CREATE_NOTICE: create_pics: created interface device for xe-0/1/3
Apr  8 10:44:21  re0.cr2-esams chassisd[1614]: CHASSISD_IFDEV_CREATE_NOTICE: create_pics: created interface device for xe-0/1/4
Apr  8 10:44:21  re0.cr2-esams chassisd[1614]: CHASSISD_IFDEV_CREATE_NOTICE: create_pics: created interface device for xe-0/1/5
Apr  8 10:44:21  re0.cr2-esams chassisd[1614]: CHASSISD_IFDEV_CREATE_NOTICE: create_pics: created interface device for xe-0/1/6
Apr  8 10:44:21  re0.cr2-esams chassisd[1614]: CHASSISD_IFDEV_CREATE_NOTICE: create_pics: created interface device for xe-0/1/7
Apr  8 10:44:21  re0.cr2-esams chassisd[1614]: CHASSISD_IFDEV_CREATE_NOTICE: create_pics: created interface device for xe-0/1/8
Apr  8 10:44:21  re0.cr2-esams chassisd[1614]: CHASSISD_IFDEV_CREATE_NOTICE: create_pics: created interface device for xe-0/1/9
Apr  8 10:44:21  re0.cr2-esams chassisd[1614]: CHASSISD_IFDEV_CREATE_NOTICE: create_pics: created interface device for xe-0/1/10
Apr  8 10:44:21  re0.cr2-esams chassisd[1614]: CHASSISD_IFDEV_CREATE_NOTICE: create_pics: created interface device for xe-0/1/11
Apr  8 10:44:22  re0.cr2-esams /kernel: kernel overwrite ae2 link-speed with child speed 10000000000
Apr  8 10:44:26  re0.cr2-esams /kernel: ae_linkstate_ifd_change: MDOWN received for interface xe-0/1/1, member of ae2
Apr  8 10:44:26  re0.cr2-esams /kernel: ae_linkstate_ifd_change: MDOWN received for interface xe-0/1/2, member of ae2
Apr  8 10:44:27  re0.cr2-esams rpd[1756]: task_connect: task BGP_43821.91.198.174.246+179 addr 91.198.174.246+179: No route to host
Apr  8 10:44:27  re0.cr2-esams rpd[1756]: bgp_connect_start: connect 91.198.174.246 (Internal AS 43821): No route to host
Apr  8 10:44:28  re0.cr2-esams /kernel: ae_linkstate_ifd_change: MUP received for interface xe-0/1/1, member of ae2
Apr  8 10:44:28  re0.cr2-esams /kernel: ae_linkstate_ifd_change: MUP received for interface xe-0/1/2, member of ae2
Apr  8 10:44:28  re0.cr2-esams chassisd[1614]: CHASSISD_IFDEV_CREATE_NOTICE: create_pics: created interface device for et-0/2/0
Apr  8 10:44:28  re0.cr2-esams chassisd[1614]: CHASSISD_IFDEV_CREATE_NOTICE: create_pics: created interface device for et-0/2/1
Apr  8 10:44:28  re0.cr2-esams chassisd[1614]: CHASSISD_IFDEV_CREATE_NOTICE: create_pics: created interface device for et-0/2/2
Apr  8 10:44:28  re0.cr2-esams /kernel: kernel overwrite ae1 link-speed with child speed 40000000000
Apr  8 10:44:29  re0.cr2-esams /kernel: ae_linkstate_ifd_change: MDOWN received for interface et-0/2/1, member of ae1
Apr  8 10:44:29  re0.cr2-esams /kernel: : port status changed
Apr  8 10:44:29  re0.cr2-esams /kernel: ae_linkstate_ifd_change: MDOWN received for interface et-0/2/0, member of ae1
Apr  8 10:44:29  re0.cr2-esams /kernel: : port status changed
Apr  8 10:44:30  re0.cr2-esams fpc0 CMIC(0/1) link 9 SFP laser bias current low  alarm set

And then a lot of of RPD_BGP_NEIGHBOR_STATE_CHANGED: BGP peer XX.XX.XX (External AS XXXX) changed state from OpenConfirm to Established.

As best as I can tell from looking at a longer section of the cr2-esams logs, it really does look like esams remote hands already swapped in the replacement part and things came up normally (with a brief 503 spike). The part was already on-site a day or so before this according to UPS tracking. The logs are currently spamming some errors about misconfigured BGP peers, but that may well be "normal". A large number of peers are established and working fine.

Update: others noticed the serial number didn't change. So, the new part is not yet installed, and we're not sure whether the old part recovered spontaneously, or due to some local action (e.g. reseated while inspecting, etc)

Mentioned in SAL (#wikimedia-operations) [2017-04-10T09:17:09Z] <XioNoX> remote hands work started to replace the FPC on cr2-esams T162239

Mentioned in SAL (#wikimedia-operations) [2017-04-10T09:44:34Z] <XioNoX> all interfaces back up on cr2-esams, BGP sessions up as well T162239

return part UPS tracking#: 1Z81648Y9142072038

Juniper received the faulty part,

Thank you for returning your defective product in relation to your recently created RMA. This notification confirms that Juniper has received the following defective part at our return center location.

vrrp preference rolled back:

ayounsi@cr1-esams# show | compare 
[edit groups vrrp interfaces <*> unit <*> family inet address <*> vrrp-group <*>]
-         priority 110;
[edit groups vrrp interfaces <*> unit <*> family inet6 address <*> vrrp-inet6-group <*>]
-         priority 110;

Good to close.