Page MenuHomePhabricator

asw2-a-eqiad FPC5 gets disconnected every 10 minutes
Closed, ResolvedPublic

Description

(Note that confusingly asw2-a-eqiad FPC5 != asw2-a5-eqiad; I'm referring to the former here rather than the latter)

This may or may not be related to T201095 (surprised FPC5 wasn't mentioned there?), but asw2-a-eqiad member 5 seems to be getting disconnected from the stack every 10 minutes on the clock:

Aug  3 06:23:18  asw2-a-eqiad chassisd[1703]: CHASSISD_VCHASSIS_MEMBER_OP_NOTICE: Member change: vc delete of member 5
Aug  3 06:33:14  asw2-a-eqiad chassisd[1703]: CHASSISD_VCHASSIS_MEMBER_OP_NOTICE: Member change: vc delete of member 5
Aug  3 06:43:09  asw2-a-eqiad chassisd[1703]: CHASSISD_VCHASSIS_MEMBER_OP_NOTICE: Member change: vc delete of member 5
Aug  3 06:53:05  asw2-a-eqiad chassisd[1703]: CHASSISD_VCHASSIS_MEMBER_OP_NOTICE: Member change: vc delete of member 5
Aug  3 07:03:05  asw2-a-eqiad chassisd[1703]: CHASSISD_VCHASSIS_MEMBER_OP_NOTICE: Member change: vc delete of member 5
Aug  3 07:13:00  asw2-a-eqiad chassisd[1703]: CHASSISD_VCHASSIS_MEMBER_OP_NOTICE: Member change: vc delete of member 5

This seems to have been going on for as long as there are logs on the switch (i.e. until Aug 2nd 02:14 UTC). I also got this while logged in on the switch:

Message from syslogd@asw2-a-eqiad at Aug  3 07:22:43  ...
asw2-a-eqiad fpc5 PFEMAN: Shutting down in 5 seconds, reconnect aborted: null sequence number

FPC5 currently seems to only have just a couple of hosts that are being provisioned (dbproxy1012 and labstore1008) so service impact is non-existent at the moment -- assuming of course that these flaps aren't cascading into wider VC issues, which wouldn't surprise me...

Event Timeline

faidon triaged this task as High priority.Aug 3 2018, 9:04 AM
faidon created this task.
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptAug 3 2018, 9:04 AM
ayounsi claimed this task.EditedAug 3 2018, 4:28 PM
ayounsi added a subscriber: ayounsi.

Juniper ticket 2018-0803-0360 created.

According to Kibana, this started ~6h after T201095 got created.

As it outputs Critical and Emergency logs, another question is why LibreNMS didn't send alerts about them, as they have been detected (cf. https://librenms.wikimedia.org/alert-log/ ) to be investigated later.
Edit: Added an alert for Emergency level logs

Also doesn't seem related, but worth mentioning, I recently opened case 2018-0710-0720 with Juniper about critical syslog "fpc8 Rear QSFP+ PIC" high temperature syslogs, where after troubleshooting, JTAC concluded it was false positive/aesthetic only.

JTAC recommendation is to format and re-install the switch member using:
https://kb.juniper.net/InfoCenter/index?page=content&id=KB20643

In their emails they say that only usb is possible, so I don't know if there is a limitation with tftp or if it's an oversight.
But as we don't have anyone in the DC until Monday the only option is tftp.

In the meantime I'm also gathering more logs and a core-dump for them.

Mentioned in SAL (#wikimedia-operations) [2018-08-03T23:07:28Z] <XioNoX> - restart asw2-b5-eqiad into loader - T201145

Regular member restart didn't help.

Juniper's tftp doesn't seem to support large files. So the USB method is the only option.

asw2-a5 is on the loader> prompt and stable. Doesn't needs anyone to go to the DC over the evening/weekend (no ongoing services alerts, etc..), but needs to be addressed first thing on Monday.

USB method isn't working neither, followed up with JTAC, if no quick resolution, we have spare EX4300 to swap it with.

loader> install --format file:///jinstall-ex-4300-14.1X53-D46.7-domestic-signed.tgz
cannot open package (error 22)

MD5 is good, the drive is detected, FAT32, the image is the only file on the drive.

JTAC is RMA'ing the device. As we have spares, we will swap it with a spare one.

@Cmjohnson please swap the switch with another EX4300, and only connect console and the usb drive.

Once it's ready to join the virtual chassis, I'll need you to connect the VC cables, prod ports and mgmt ports.

I swapped out the ex4300 with a current spare wmf7314. @ayounsi can you give me the details of the RMA and the shipping label. Please email.

Replacing fpc5 didn't solve the issue... Following up...

ayounsi added a comment.EditedAug 7 2018, 8:53 PM

The root cause seems to be a design issue, in a Virtual Chassis Fabric (VCF), which is what we use everywhere (eqiad and codfw), spines (routing-engines) can't be connected to spines (which we don't do), and leafs (linecards) can't be connected to leafs (which we do everywhere).

In the current asw2-a/b-eqiad design, FPC5 is only connected to multiple leafs (see current diagram), which confused the fabrics as it considers it a spine:

ayounsi@asw2-a-eqiad# run show virtual-chassis       
                                                Mstr           Mixed Route Neighbor List
Member ID  Status   Serial No    Model          prio  Role      Mode  Mode ID  Interface
5 (FPC 5)  Prsnt    PE3717450101 ex4300-48t     128   Linecard     Y  F    3  vcp-255/1/0
                                                                           6  vcp-255/1/2

I temporarily disabled the VC links fpc5/fpc3 and fpc5/fpc4, to only leave fpc5 connected to fpc6, then restarted fpc5. After ~15min priority dropped to 0 and the disconnect seem to have stopped.

The permanent fix is non trivial though.

Going full VCF (see diagram 2) with our current switches would only require VC links moves and additions, but would use all 6 QSFP ports on FPC2/FPC7 (non ideal), and is not possible in codfw (we use 1 QSFP port on each as uplinks.

Going "half" VCF, (see diagram 3), similarly to the previous design (currently running on all the asw-x-eqiad/codfw), free up 2 QSFP port on the spines, but raises a few unknowns:

  • How stable is it when all members are up (current experience seems to indicate that it works well)
  • How would the chassis behave if one of the spines (fpc2/fpc7) goes offline, as some of the leafs (1/3/6/7) would not be directly connected to a spine anymore (do we have past experience? If not should probably be tested)

Going to a VC (non Fabric) mode (ring), could also be investigated, but is non ideal as well as it includes the limitation of the VC, and would require a Virtual Chassis reboot.

It's possible that asw2-b-eqiad gets affected as well (maybe the current issues there are different symptoms of the same issue), asw2-c-eqiad isn't as this fabric has 1 less member.

To stabilize things, I want tomorrow to connect fpc5 to fpc7, (and fpc2 if DAC lenght permits).
As well as do the same with asw2-b-eqiad to avoid the same issue to happen there.

In addition inquire if the cables "JNP-QSFP-DAC-10MA" and "JNP-QSFP-DAC-7MA" can be used for VC links (they used to not be at all in the Hardware Compatibility Tool).

BBlack added a subscriber: mark.Aug 7 2018, 9:30 PM

asw2-a-eqiad now looks like the 3rd diagram (all leafs have at least 1 link to a spine).

fpc4 is connected to fpc2/fpc6/fpc7 (removed fpc3 links)
fpc5 is connected to fpc2/fpc3/fpc7 (removed fpc4 and fpc6 link, added fpc2 and fpc7)

The above changes led to a malfunction of asw2-a-eqiad starting at 17:45 UTC causing:
~35% packet loss to hosts in row A, this also impacted hosts on asw for traffic coming from cr1 as the current topology is cr1-asw2-asw-cr2 (cr2 being vrrp master).
fpc1-fpc3 (at 18:30 UTC) and fpc2-fpc4 (at 18:47 UTC) have been disabled and now match the following topology.


fpc3 is still showing signs of malfunctions, but no hosts are currently on that member and other members seem stable.

ayounsi added a comment.EditedAug 9 2018, 4:34 PM

Suggested plan to have asw2-a-eqiad similar to codfw:

Need to be added:
fpc1-fpc3
fpc3-fpc4
fpc5-fpc6

Need to be removed:
fpc2-fpc4
fpc5-fpc7
fpc3-fpc5

In order, based on the hypothesis that it's better to remove links first before adding more:

Connect fpc3:1/3-fpc4:0/49 (already connected and disabled)
connect fpc5:1/3-fpc6:1/1 (keep disabled)

disable fpc3-fpc5
request virtual-chassis vc-port delete pic-slot 1 port 0 member 3
disable fpc5-fpc7
request virtual-chassis vc-port delete pic-slot 0 port 51 member 7

enable fpc1-fpc3
request virtual-chassis vc-port set pic-slot 0 port 0 member 1
enable fpc3-fpc4
request virtual-chassis vc-port set pic-slot 1 port 3 member 3
request virtual-chassis vc-port set pic-slot 0 port 49 member 4
enable fpc5-fpc6
request virtual-chassis vc-port set pic-slot 1 port 3 member 5
request virtual-chassis vc-port set pic-slot 1 port 1 member 6

And have the request virtual-chassis vc-port [set|delete] delete pic-slot X port Y member Z ready to perform the migration as well as their opposite to quickly rollback if needed.

Mentioned in SAL (#wikimedia-operations) [2018-08-09T17:58:51Z] <XioNoX> connecting asw2-a5-eqiad to asw2-a6-eqiad - T201145

Mentioned in SAL (#wikimedia-operations) [2018-08-09T20:44:12Z] <XioNoX> disable fpc3-fpc5 and fpc5-fpc7 - T201145

Mentioned in SAL (#wikimedia-operations) [2018-08-09T20:49:57Z] <XioNoX> enable fpc1-fpc3 - T201145

Mentioned in SAL (#wikimedia-operations) [2018-08-09T20:54:55Z] <XioNoX> enable fpc3-fpc4 - T201145

Mentioned in SAL (#wikimedia-operations) [2018-08-09T21:01:14Z] <XioNoX> enable fpc5-fpc6 - T201145

Mentioned in SAL (#wikimedia-operations) [2018-08-09T21:06:14Z] <XioNoX> disable fpc4-fpc6 - T201145

Mentioned in SAL (#wikimedia-operations) [2018-08-09T21:21:50Z] <XioNoX> disable fpc5-fpc6 - T201145

Mentioned in SAL (#wikimedia-operations) [2018-08-09T21:31:43Z] <XioNoX> disable fpc3-fpc4 - T201145

Mentioned in SAL (#wikimedia-operations) [2018-08-09T22:31:14Z] <XioNoX> disable fpc1-fpc3 - T201145

A chassis reboot cleared that specific issue.

ayounsi added a comment.EditedSep 5 2018, 6:36 PM

asw2-a-eqiad members upgraded to 14.1X53-D47.3 (QFX host included). For the record, downtime was 10min.
request system software add set [ /var/tmp/jinstall-ex-4300-14.1X53-D47.3-domestic-signed.tgz /var/tmp/jinstall-qfx-5-14.1X53-D47.3-domestic-signed.tgz ] force-host validate no-copy

ayounsi added a comment.EditedSep 5 2018, 8:37 PM

Migration step toward a compatible topology.

Opened T203606 for the 7m DAC.

Step 1

  • Disconnect already disabled links
  • fpc1-fpc3 (5m DAC)
  • fpc3-fpc5 (5m DAC)
  • fpc3-fpc4 (3m DAC)
  • fpc4-fpc6 (5m DAC)
  • fpc5-fpc6 (3m DAC)

Step 2

  • Enable fpc2-fpc4 + fpc5-fpc7
  • Disconnect last X-leafs links
  • fpc1-fpc8 (40G optics + fiber)
  • fpc6-fpc8 (5m DAC)
  • Test connectivity between different members (eg FPC1 to FPC8)

It will be useful to know how this behaves when we will reconfigure production rows.
Eg. Can we have both spines linked by only some leafs temporarily, without causing instabilities to the fabric, before connecting remaining links

Step 3

  • Add missing links
  • fpc1-fpc7 (40G optics + fiber)
  • fpc3-fpc7 (7m DAC)
  • fpc6-fpc2 (7m DAC)
  • Replace fpc4-fpc7 with 5M DAC
  • Add last link: fpc8-fpc2 (40G optics + fiber)
ayounsi mentioned this in Unknown Object (Task).Sep 5 2018, 8:52 PM

Cabling has been done out of order, but end result is there. (minus the 7m DAC).

During the re-cabling, the fabric was very unstable: frequent disconnects between members, adjacencies appearing/disappearing.
A reboot of the fabric didn't solve the issue.

Disabling the FPC3-FPC2 link made everything stable again, and stayed stable after re-enabling it.

FPC3 seems to be able to reach FPC6:

ayounsi@asw2-a-eqiad> show virtual-chassis vc-path source-interface ge-3/0/0 destination-interface ge-6/0/0    
Fabric forwarding path from ge-3/0/0 (PFE 3) to ge-6/0/0 (PFE 6)
  Hop   0  Member-ID   3  PFE    3
    Next-hop PFE   2
      Interface vcp-255/1/0.32768  Bandwidth   40
  Hop   1  Member-ID   2  PFE    2
    Next-hop PFE   1
      Interface vcp-255/0/48.32768 Bandwidth   40
    Next-hop PFE   4
      Interface vcp-255/0/50.32768 Bandwidth   40
    Next-hop PFE   5
      Interface vcp-255/0/51.32768 Bandwidth   40
    Next-hop PFE   8
      Interface vcp-255/0/52.32768 Bandwidth   40
  Hop   2  Member-ID   1  PFE    1
    Next-hop PFE   7
      Interface vcp-255/1/1.32768  Bandwidth   40
  Hop   2  Member-ID   4  PFE    4
    Next-hop PFE   7
      Interface vcp-255/0/49.32768 Bandwidth   40
  Hop   2  Member-ID   5  PFE    5
    Next-hop PFE   7
      Interface vcp-255/1/1.32768  Bandwidth   40
  Hop   2  Member-ID   8  PFE    8
    Next-hop PFE   7
      Interface vcp-255/1/0.32768  Bandwidth   40
  Hop   3  Member-ID   7  PFE    7
    Next-hop PFE   6
      Interface vcp-255/0/52.32768 Bandwidth   40
  Hop   4  Member-ID   6  PFE    6

FPC3->FPC2->FPC1/4/5/8->FPC7->FPC6. Sub-optimal but that's in the rare case of 2 links going down.

Some notes for row B re-cabling:

  • We should stick to the re-cabling plan (which should minimize downtime, and know exactly where we're at)
  • We should prepare for the worse (~30min downtime of the whole stack)
  • Prepare the commands to isolate/sacrifice some leafs in case the fabric becomes unstable (eg. FPC3/FPC6 as seen with row A)

Next steps for row A are:

  1. Connect asw2-a to asw-a
  2. Connect non production servers to fpc3 and fpc6 to test connectivity (and be ready to cut FPC3/FPC6 off the VCF if any sign of instability)
  3. Connect the 7m DAC for fpc3/fpc6 when we get them
  4. Re-introduce asw2 between cr1-eqiad and asw-a-eqiad
  5. Move more servers to asw2-a-eqiad
BBlack added a subscriber: BBlack.Sep 7 2018, 7:53 PM

I think, it's hard to evaluate the stability of the intended, supported VCF design while in an intermediate state. It's also probably not reasonable to expect the fabric to be stable even in the final state, if it arrives there through a set of cable changes from a known-unstable state. All I think I'd really expect ("expect" as in: I'd be shocked and wanting to call Juniper again and start all over) is that if we recable everything appropriately, and then reboot the whole thing, it should be stable.

When I look at the above:

Cabling has been done out of order, but end result is there. (minus the 7m DAC).

... but without the 7m DACs, it's not actually a legal configuration. You could make the argument that if one leaf switch were missing one uplink, then it's more or less a legal config acting like it would when we have one failed cable and handling the fault just fine. But with two of them missing on different leafs to different spines, we're violating the design rules (or inducing the kind of failure Juniper doesn't expect VCF to tolerate well, either way, you're making islands that are only interconnected by hopping through leaves rather than spines).

During the re-cabling, the fabric was very unstable: frequent disconnects between members, adjacencies appearing/disappearing.
A reboot of the fabric didn't solve the issue.

Disabling the FPC3-FPC2 link made everything stable again, and stayed stable after re-enabling it.

Right, so I expect instability during the recabling, but the final state before rebooting the fabric was an unsupported islands-connected-by-leaves state, and flapping the FPC3-FPC2 link flapped one of the island-ed parts of the configuration, which I guess maybe got things under control for now, but I wouldn't call it stable.

Some notes for row B re-cabling:

  • We should stick to the re-cabling plan (which should minimize downtime, and know exactly where we're at)
  • We should prepare for the worse (~30min downtime of the whole stack)
  • Prepare the commands to isolate/sacrifice some leafs in case the fabric becomes unstable (eg. FPC3/FPC6 as seen with row A)

Next steps for row A are:

  1. Connect asw2-a to asw-a
  2. Connect non production servers to fpc3 and fpc6 to test connectivity (and be ready to cut FPC3/FPC6 off the VCF if any sign of instability)
  3. Connect the 7m DAC for fpc3/fpc6 when we get them
  4. Re-introduce asw2 between cr1-eqiad and asw-a-eqiad
  5. Move more servers to asw2-a-eqiad

I'd argue that in the current state of row A, it's not a great idea to start moving production things back to it, and that we shouldn't repeat this scenario with Row B. I'd re-order things as:

  1. Connect the 7m DAC for fpc3/fpc6 when we get them
  2. Reboot the stack, and it should come up stable in this supported config, without any further jiggling.
  3. Connect asw2-a to asw-a
  4. Connect non production servers to <anywhere in A> to test connectivity
  5. Re-introduce asw2 between cr1-eqiad and asw-a-eqiad
  6. Move more servers to asw2-a-eqiad

And then for the next rows we try the reconfig on, have the 7m DACs (or whatever else is necessary) ready from the get-go and not even try unsupported intermediate states.

Mentioned in SAL (#wikimedia-operations) [2018-09-27T17:27:42Z] <XioNoX> reboot asw2-a-eqiad (not in prod) - T201145

The logs mentioned during the meeting seem to be the link between a2 and a8 flapping (possibly faulty optic) and VC members re-calculating paths around the failure:

Oct  1 19:33:07  asw2-a-eqiad rpd[2040]: EVENT <UpDown> vcp-255/0/52 index 132 <Broadcast Multicast>
Oct  1 19:33:07  asw2-a-eqiad vccpd[1868]: interface vcp-255/0/52 went down
Oct  1 19:33:07  asw2-a-eqiad vccpd[1868]: Member 2, interface vcp-255/0/52.32768 went down
Oct  1 19:33:07  asw2-a-eqiad fpc2 pfex_vc_get_rem_memb_nh:175 no valid port in vc_trunk 2
Oct  1 19:33:07  asw2-a-eqiad fpc2 no valid port in vc_trunk 2
Oct  1 19:33:07  asw2-a-eqiad fpc2 pfex_vc_get_rem_memb_nh:175 no valid port in vc_trunk 2
Oct  1 19:33:07  asw2-a-eqiad fpc2 no valid port in vc_trunk 2
Oct  1 19:33:07  asw2-a-eqiad vccpd[1868]: JTASK_SIGNAL_UNKNOWN: Ignoring unknown signal SIGVTALRM (26)
Oct  1 19:33:07  asw2-a-eqiad fpc7 pfe_bcm_vchassis_port_modid_egress_set:565 egress set: vc_tree_mode 1, memb 1 selfId 7 devrt_raw 0x20 0x2000000000000 vcp_mask 0x22, 0x2222000000000000
Oct  1 19:33:07  asw2-a-eqiad fpc4 pfe_bcm_release_trunk:902 Releasing trunk 2 ref count (5)
Oct  1 19:33:07  asw2-a-eqiad fpc4 pfe_bcm_vchassis_stk_modport_trunk_set:1665 dev 0 ingress_pbm 0 1ff ffffffff ffffffff dest_mod: 8 bcm_trunk_id: 1024
Oct  1 19:33:07  asw2-a-eqiad fpc7 pfe_bcm_vchassis_port_modid_egress_set:565 egress set: vc_tree_mode 1, memb 2 selfId 7 devrt_raw 0x20 0x2000000000000000 vcp_mask 0x22, 0x2222000000000000
Oct  1 19:33:07  asw2-a-eqiad fpc7 pfe_bcm_vchassis_port_modid_egress_set:565 egress set: vc_tree_mode 1, memb 3 selfId 7 devrt_raw 0x20 0x20000000000000 vcp_mask 0x22, 0x2222000000000000
Oct  1 19:33:07  asw2-a-eqiad fpc7 pfe_bcm_vchassis_port_modid_egress_set:565 egress set: vc_tree_mode 1, memb 4 selfId 7 devrt_raw 0x20 0x200000000000000 vcp_mask 0x22, 0x2222000000000000
Oct  1 19:33:07  asw2-a-eqiad fpc7 pfe_bcm_vchassis_port_modid_egress_set:565 egress set: vc_tree_mode 1, memb 5 selfId 7 devrt_raw 0x20 0x2000000000000000 vcp_mask 0x22, 0x2222000000000000
Oct  1 19:33:07  asw2-a-eqiad fpc7 pfe_bcm_vchassis_port_modid_egress_set:565 egress set: vc_tree_mode 1, memb 6 selfId 7 devrt_raw 0x22 0x0 vcp_mask 0x22, 0x2222000000000000
Oct  1 19:33:07  asw2-a-eqiad fpc7 pfe_bcm_vchassis_multi_path_irt_process:3270 processing MC: map = 3f, flag=0, num_node = 8, node_map = 1fe
Oct  1 19:33:07  asw2-a-eqiad fpc7 pfe_bcm_vchassis_multi_path_irt_process:3277 tree process time: UC = 744, MC = 6390, total = 7134 us
Oct  1 19:33:07  asw2-a-eqiad fpc2 pfe_bcm_vchassis_port_modid_egress_set:565 egress set: vc_tree_mode 1, memb 1 selfId 2 devrt_raw 0x20 0x2222000000000000 vcp_mask 0x22, 0x2222000000000000
Oct  1 19:33:07  asw2-a-eqiad fpc2 pfe_bcm_vchassis_port_modid_egress_set:565 egress set: vc_tree_mode 1, memb 2 selfId 2 devrt_raw 0x20 0x2222000000000000 vcp_mask 0x22, 0x2222000000000000
Oct  1 19:33:07  asw2-a-eqiad fpc2 pfe_bcm_vchassis_port_modid_egress_set:565 egress set: vc_tree_mode 1, memb 3 selfId 2 devrt_raw 0x20 0x2222000000000000 vcp_mask 0x22, 0x2222000000000000
Oct  1 19:33:07  asw2-a-eqiad fpc2 pfe_bcm_vchassis_port_modid_egress_set:565 egress set: vc_tree_mode 1, memb 4 selfId 2 devrt_raw 0x20 0x2222000000000000 vcp_mask 0x22, 0x2222000000000000
Oct  1 19:33:07  asw2-a-eqiad fpc2 pfe_bcm_vchassis_port_modid_egress_set:565 egress set: vc_tree_mode 1, memb 5 selfId 2 devrt_raw 0x20 0x2222000000000000 vcp_mask 0x22, 0x2222000000000000
Oct  1 19:33:07  asw2-a-eqiad fpc2 pfe_bcm_vchassis_port_modid_egress_set:565 egress set: vc_tree_mode 1, memb 6 selfId 2 devrt_raw 0x20 0x2222000000000000 vcp_mask 0x22, 0x2222000000000000
Oct  1 19:33:07  asw2-a-eqiad fpc2 pfe_bcm_release_trunk:902 Releasing trunk 6 ref count (1)
Oct  1 19:33:07  asw2-a-eqiad fpc2 pfe_bcm_delete_trunk:872 Deleting trunk 6 of type 2
Oct  1 19:33:07  asw2-a-eqiad fpc2 pfe_bcm_vchassis_stk_modport_trunk_set:1665 dev 0 ingress_pbm 0 1ff ffffffff ffffffff dest_mod: 7 bcm_trunk_id: 1030
Oct  1 19:33:07  asw2-a-eqiad fpc2 pfe_bcm_release_trunk:902 Releasing trunk 2 ref count (1)
Oct  1 19:33:07  asw2-a-eqiad fpc2 pfe_bcm_delete_trunk:872 Deleting trunk 2 of type 1
Oct  1 19:33:07  asw2-a-eqiad fpc2 pfe_bcm_vchassis_stk_modport_trunk_set:1665 dev 0 ingress_pbm 0 1ff ffffffff ffffffff dest_mod: 8 bcm_trunk_id: 1030
Oct  1 19:33:07  asw2-a-eqiad fpc2 pfe_bcm_vchassis_port_modid_egress_set:565 egress set: vc_tree_mode 1, memb 8 selfId 2 devrt_raw 0x20 0x2222000000000000 vcp_mask 0x22, 0x2222000000000000
Oct  1 19:33:07  asw2-a-eqiad fpc2 pfe_bcm_vchassis_block_unused_hg_ports:485 unused hg blocked = 0, used hg = 2222000000000000
Oct  1 19:33:07  asw2-a-eqiad fpc2 pfe_bcm_vchassis_multi_path_irt_process:3270 processing MC: map = bf, flag=0, num_node = 8, node_map = 1fe
Oct  1 19:33:07  asw2-a-eqiad fpc2 pfe_bcm_vchassis_multi_path_irt_process:3277 tree process time: UC = 19024, MC = 1668, total = 20692 us
Oct  1 19:33:08  asw2-a-eqiad fpc2 UPDN msg to kernel for ifd:vcp-255/0/52, flag:1, speed: 40000000000, duplex:2
Oct  1 19:33:08  asw2-a-eqiad mcsnoopd[2065]: EVENT <UpDown> vcp-255/0/52.32768 index 68 <Up Broadcast Multicast>
Oct  1 19:33:08  asw2-a-eqiad vccpd[1868]: VCCPD_PROTOCOL_ADJUP: New adjacency to c042.d045.7ac0 on vcp-255/0/52.32768
Oct  1 19:33:08  asw2-a-eqiad mcsnoopd[2065]: EVENT <UpDown> vcp-255/0/52 index 132 <Up Broadcast Multicast>
Oct  1 19:33:08  asw2-a-eqiad rpd[2040]: EVENT <UpDown> vcp-255/0/52.32768 index 68 <Up Broadcast Multicast>
Oct  1 19:33:08  asw2-a-eqiad rpd[2040]: EVENT <UpDown> vcp-255/0/52 index 132 <Up Broadcast Multicast>
Oct  1 19:33:08  asw2-a-eqiad vccpd[1868]: interface vcp-255/0/52 came up
Oct  1 19:33:08  asw2-a-eqiad vccpd[1868]: Member 2, interface vcp-255/0/52.32768 came up
Oct  1 19:33:08  asw2-a-eqiad fpc7 pfe_bcm_vchassis_port_modid_egress_set:565 egress set: vc_tree_mode 1, memb 1 selfId 7 devrt_raw 0x0 0x2000000000000 vcp_mask 0x22, 0x2222000000000000
Oct  1 19:33:08  asw2-a-eqiad vccpd[1868]: JTASK_SIGNAL_UNKNOWN: Ignoring unknown signal SIGVTALRM (26)
Oct  1 19:33:08  asw2-a-eqiad fpc7 pfe_bcm_vchassis_port_modid_egress_set:565 egress set: vc_tree_mode 1, memb 2 selfId 7 devrt_raw 0x0 0x2000000000000000 vcp_mask 0x22, 0x2222000000000000
Oct  1 19:33:08  asw2-a-eqiad fpc7 pfe_bcm_vchassis_port_modid_egress_set:565 egress set: vc_tree_mode 1, memb 3 selfId 7 devrt_raw 0x0 0x20000000000000 vcp_mask 0x22, 0x2222000000000000
Oct  1 19:33:08  asw2-a-eqiad fpc7 pfe_bcm_vchassis_port_modid_egress_set:565 egress set: vc_tree_mode 1, memb 4 selfId 7 devrt_raw 0x0 0x200000000000000 vcp_mask 0x22, 0x2222000000000000
Oct  1 19:33:08  asw2-a-eqiad fpc7 pfe_bcm_vchassis_port_modid_egress_set:565 egress set: vc_tree_mode 1, memb 5 selfId 7 devrt_raw 0x0 0x2000000000000000 vcp_mask 0x22, 0x2222000000000000
Oct  1 19:33:08  asw2-a-eqiad fpc7 pfe_bcm_vchassis_port_modid_egress_set:565 egress set: vc_tree_mode 1, memb 6 selfId 7 devrt_raw 0x2 0x0 vcp_mask 0x22, 0x2222000000000000
Oct  1 19:33:08  asw2-a-eqiad fpc7 pfe_bcm_vchassis_multi_path_irt_process:3270 processing MC: map = 3f, flag=0, num_node = 8, node_map = 1fe
Oct  1 19:33:08  asw2-a-eqiad fpc7 pfe_bcm_vchassis_multi_path_irt_process:3277 tree process time: UC = 770, MC = 1702, total = 2472 us
Oct  1 19:33:08  asw2-a-eqiad fpc2 pfe_bcm_vchassis_port_modid_egress_set:565 egress set: vc_tree_mode 1, memb 1 selfId 2 devrt_raw 0x22 0x2222000000000000 vcp_mask 0x22, 0x2222000000000000
Oct  1 19:33:08  asw2-a-eqiad fpc2 pfe_bcm_vchassis_port_modid_egress_set:565 egress set: vc_tree_mode 1, memb 2 selfId 2 devrt_raw 0x22 0x2222000000000000 vcp_mask 0x22, 0x2222000000000000
Oct  1 19:33:08  asw2-a-eqiad fpc2 pfe_bcm_vchassis_port_modid_egress_set:565 egress set: vc_tree_mode 1, memb 3 selfId 2 devrt_raw 0x22 0x2222000000000000 vcp_mask 0x22, 0x2222000000000000
Oct  1 19:33:08  asw2-a-eqiad fpc2 pfe_bcm_vchassis_port_modid_egress_set:565 egress set: vc_tree_mode 1, memb 4 selfId 2 devrt_raw 0x22 0x2222000000000000 vcp_mask 0x22, 0x2222000000000000
Oct  1 19:33:08  asw2-a-eqiad fpc2 pfe_bcm_vchassis_port_modid_egress_set:565 egress set: vc_tree_mode 1, memb 5 selfId 2 devrt_raw 0x22 0x2222000000000000 vcp_mask 0x22, 0x2222000000000000
Oct  1 19:33:08  asw2-a-eqiad fpc2 pfe_bcm_vchassis_port_modid_egress_set:565 egress set: vc_tree_mode 1, memb 6 selfId 2 devrt_raw 0x22 0x2222000000000000 vcp_mask 0x22, 0x2222000000000000
Oct  1 19:33:08  asw2-a-eqiad fpc2 pfe_bcm_release_trunk:902 Releasing trunk 6 ref count (2)
Oct  1 19:33:08  asw2-a-eqiad fpc2 pfe_bcm_vchassis_stk_modport_trunk_set:1665 dev 0 ingress_pbm 0 1ff ffffffff ffffffff dest_mod: 7 bcm_trunk_id: 1026
Oct  1 19:33:08  asw2-a-eqiad fpc2 pfe_bcm_vchassis_block_unused_hg_ports:485 unused hg blocked = 0, used hg = 2222000000000000
Oct  1 19:33:08  asw2-a-eqiad fpc2 pfe_bcm_release_trunk:902 Releasing trunk 6 ref count (1)
Oct  1 19:33:08  asw2-a-eqiad fpc2 pfe_bcm_delete_trunk:872 Deleting trunk 6 of type 2
Oct  1 19:33:08  asw2-a-eqiad fpc2 pfe_bcm_vchassis_stk_modport_trunk_set:1665 dev 0 ingress_pbm 0 1ff ffffffff ffffffff dest_mod: 8 bcm_trunk_id: 1030
Oct  1 19:33:08  asw2-a-eqiad fpc2 pfe_bcm_vchassis_port_modid_egress_set:565 egress set: vc_tree_mode 1, memb 8 selfId 2 devrt_raw 0x22 0x2222000000000000 vcp_mask 0x22, 0x2222000000000000
Oct  1 19:33:08  asw2-a-eqiad fpc2 pfe_bcm_vchassis_multi_path_irt_process:3270 processing MC: map = bf, flag=0, num_node = 8, node_map = 1fe
Oct  1 19:33:08  asw2-a-eqiad fpc2 pfe_bcm_vchassis_multi_path_irt_process:3277 tree process time: UC = 19669, MC = 1728, total = 21397 us
Oct  1 19:33:08  asw2-a-eqiad fpc4 pfe_bcm_release_trunk:902 Releasing trunk 0 ref count (2)
Oct  1 19:33:08  asw2-a-eqiad fpc4 pfe_bcm_vchassis_stk_modport_trunk_set:1665 dev 0 ingress_pbm 0 1ff ffffffff ffffffff dest_mod: 8 bcm_trunk_id: 1026

@Cmjohnson, can you replace the optic on asw2-a2-eqiad:et-0/0/52, then on asw2-a8-eqiad:et-0/1/1 if still ongoing?

ayounsi added a comment.EditedOct 3 2018, 3:55 PM

Optic replaced yesterday and confirmed no more issues.

Steps for today:

  • Verify cr2-eqiad is VRRP master show vrrp interface ae1 | match state
  • Disable interfaces from cr1-eqiad:ae1 to asw-a
  • Move cr1 router uplinks from asw-a to asw2-a (and document cable IDs if different) [Chris]
xe-2/0/44 -> cr1-eqiad:xe-3/0/0
xe-2/0/45 -> cr1-eqiad:xe-3/1/0
xe-7/0/44 -> cr1-eqiad:xe-4/0/0
xe-7/0/45 -> cr1-eqiad:xe-4/1/0
  • Connect asw2-a with asw-a with 4x10G (and document cable IDs if different) [Chris]
xe-2/0/42 -> asw-a-eqiad:xe-8/1/0
xe-7/0/42 -> asw-a-eqiad:xe-2/1/0
xe-2/0/43 -> asw-a-eqiad:xe-1/1/0
xe-7/0/43 -> asw-a-eqiad:xe-7/0/0
  • Verify traffic is properly flowing though asw2-a
  • Update interfaces descriptions on cr1/asw/asw2

Mentioned in SAL (#wikimedia-operations) [2018-10-03T17:26:17Z] <XioNoX> disable cr1-eqiad:ae1 - T201145

Mentioned in SAL (#wikimedia-operations) [2018-10-03T17:28:12Z] <XioNoX> start of recabling asw2-a-eqiad between asw and cr1 - T201145

Mentioned in SAL (#wikimedia-operations) [2018-10-03T17:42:30Z] <XioNoX> re-enable cr1-eqiad:ae1 - T201145

ayounsi closed this task as Resolved.Oct 3 2018, 6:26 PM

This is now stable. Back to T187960 for the remaining steps.