Page MenuHomePhabricator

asw2-d1-eqiad:VCP failure
Closed, ResolvedPublic

Description

Earlier today this started to flap in asw2-d-eqiad logs:

May 14 16:15:28 asw2-d-eqiad fpc1 Rear QSFP+ PIC Chan# 3: Rx loss set
May 14 16:16:28 asw2-d-eqiad fpc1 Rear QSFP+ PIC Chan# 3: Rx loss cleared

Unfortunately, the logs are not clear on which port is having an issue.

However, from:

asw2-d-eqiad> show chassis pic fpc-slot 1 pic-slot 1    
[...]
                         Fiber                    Xcvr vendor       Wave-    Xcvr
  Port Cable type        type  Xcvr vendor        part number       length   Firmware
  0    40GBASE CU 3M     n/a   FiberStore         QSFP-40G-DAC      n/a      0.0   
  1    40GBASE CU 3M     n/a   FiberStore         QSFP-40G-DAC      n/a      0.0   
  2    40GBASE CU 3M     n/a   @i```Bdl           \x10\x10                n/a      0.0
asw2-d-eqiad> show chassis hardware
[...]
    Xcvr 2       REV 01   74           A                 QSFP+-40G-CU3M

It looks like the 3M DAC connected on fpc1:1/2 to fpc2:0/48 is faulty.

Event Timeline

ayounsi triaged this task as High priority.May 14 2020, 4:33 PM
ayounsi created this task.

Mentioned in SAL (#wikimedia-operations) [2020-05-14T16:36:26Z] <XioNoX> asw2-d-eqiad> request virtual-chassis vc-port delete pic-slot 0 port 48 member 2 - T252797

Mentioned in SAL (#wikimedia-operations) [2020-05-14T16:42:38Z] <XioNoX> request virtual-chassis vc-port delete pic-slot 1 port 2 member 1 - T252797

I disabled the mentioned link on the fpc2 side (so we don't risk fully losing access to fpc1) first.
Then on the fpc1 side to check if the alert was caused by this DAC.

Unfortunately it looks like the errors are still happening:

May 14 16:47:30 asw2-d-eqiad fpc1 Rear QSFP+ PIC Chan# 3: Rx loss set

Rolling back the changes.

Mentioned in SAL (#wikimedia-operations) [2020-05-14T16:49:58Z] <XioNoX> request virtual-chassis vc-port set pic-slot 1 port 2 member 1 - T252797

Mentioned in SAL (#wikimedia-operations) [2020-05-14T16:51:19Z] <XioNoX> asw2-d-eqiad> request virtual-chassis vc-port set pic-slot 0 port 48 member 2 - T252797

Mentioned in SAL (#wikimedia-operations) [2020-05-14T16:55:23Z] <XioNoX> asw2-d-eqiad> request virtual-chassis vc-port delete pic-slot 1 port 3 member 1 - T252797

pic-slot 1 port 3 member 1 was a leftover port configured as VC port, but without any cable connected to it.
Errors are still happening.

Mentioned in SAL (#wikimedia-operations) [2020-05-14T17:02:59Z] <XioNoX> asw2-d-eqiad> request virtual-chassis vc-port delete pic-slot 1 port 1 member 1 - T252797

Disabled the last link, and the errors are still showing up, so I'm confused on where the issue is coming from.

From T218059#5075466 it probably due to the link disabled in T251663 acting up.

@Jclark-ctr, please unplug fpc1:1/0 (and remove/store the optics) from both sides, fpc8:1/0 (link should be down as well) but don't remove the fiber in case we need to connect them back.

Unplugging that link caused fpc1 to lose connectivity to the remaining of the VC, while it's neither a VCP, nor enabled.

asw2-d-eqiad fpc1 PFEMAN: Shutting down in 5 seconds, PFEMAN Resync aborted! No peer info on reconnect or master rebooted?
asw2-d-eqiad fpc1 CMLC: Going disconnected; Routing engine chassis socket closed abruptly

It has been re-plugged for the time being, as even if the alerts say "critical", it doesn't seem to be related to the 2 primary uplinks.

We also might need to replace the link reporting as @i`Bdl \x10\x10 as it might fail sooner than later.

@ayounsi can we close this ticket? or anything i can do?

ayounsi changed the task status from Open to Stalled.Jun 23 2020, 11:36 AM

Larger discussion in T256112. Feel free to un-assign it from you until we figure out the overall plan.