Page MenuHomePhabricator

csw2-esams's VCP link flapped
Closed, DeclinedPublic0 Estimated Story Points

Description

Some alarms fired this morning EU time around 3:15~3:40 UTC for csw2-esams. From the logs I can see:

elukey@csw2-esams> show log messages.1.gz | match VCCPD_PROTOCOL_
Aug  4 03:05:14  csw2-esams vccpd[1368]: VCCPD_PROTOCOL_ADJDOWN: Lost adjacency to 50c5.8da8.3100 on vcp-1.32768,
Aug  4 03:05:15  csw2-esams vccpd[1368]: VCCPD_PROTOCOL_ADJUP: New adjacency to 50c5.8da8.3100 on vcp-1.32768

elukey@csw2-esams> show virtual-chassis protocol adjacency detail
fpc0:
--------------------------------------------------------------------------

[..]

50c5.8da8.3100
  interface-name: vcp-1.32768, State: Up, Expires in 57 secs
  Priority: 0, Up/Down transitions: 17, Last transition: 08:16:44 ago         <----

Event Timeline

Seems like this device is seeing its end coming with the esams refresh.

The on disk logs have rolled over but syslog logs are visible on https://logstash.wikimedia.org/goto/08b36fc12fdc83accef419f308f96646

This Virtual Chassis is configured and cabled as a ring. Each member is connected to its previous and next neighbor. 0-2-4-5-0.
Here it seems like the link between FPC0 and FPC5 failed.

EX4200s use dedicated Virtual Chassis ports, unlike more recent EXs where any port can be converted as VC.
The devices still have next day support, I opened case 2019-0808-0701 with JTAC.

EX4200 can also have any port converted as VC - just won't be as fast, max 10Gbps.

JTAC found a core dump on fpc4 and fpc5, sent to Juniper for analysis.

I finished working on them but I was not able to match the digital trace to any software report like bug or PR.
When there is a core-dump alongside to an event that caused an issue and we can not match this to an existing software defect we usually send the case to engineering.
On this case I understand that there was no impact but also the software version, 12.3R6.6 is outdated, engineering development is no longer working on this versions.
My recommendation is to upgrade to one of the newer version.

As this switch stack ils quite old and will be decommissioned in October, I don't think it's worth trying an upgrade. The issue didn't happen again and hopefully won't happen again in the next 2 months.
We can revisit this task if it does.