Page MenuHomePhabricator

D1<->D8 VC link failure
Open, HighPublic

Description

The VC link between row D fpc1:1/0 and fpc8:1/0 has been flapping causing connectivity issues to hosts in both racks.

May  2 06:52:08  asw2-d-eqiad fpc1 [EX-BCM PIC] ex_bcm_linkscan_handler: Link 54 UP
May  2 06:52:08  asw2-d-eqiad fpc1  [EX-BCM PIC] phy_40g_cr4_an_status : Port 54 mii_status = 0x88, ll_adv = 0x1, lp_adv = 0x0
May  2 06:52:08  asw2-d-eqiad fpc1 [EX-BCM PIC] phy_40g_cr4_an_status : Port 54 pause resolution = 0, ll_p = 0x0 lp_p = 0x0
May  2 06:52:08  asw2-d-eqiad fpc1 [EX-BCM PIC] ex_bcm_cr4_get_remote_pause: GET REMOTE PAUSE = 0x0, port 54
May  2 06:52:08  asw2-d-eqiad fpc1 BCM Error: API bcm_port_advert_remote_get(device, port, &ablity) at ex_bcm_get_remote_ability:716 -> Operation disabled
May  2 06:52:08  asw2-d-eqiad fpc1 [EX-BCM PIC] ex_bcm_pic_get_an_info: Failed to get the remote ability for Rear QSFP+ PIC port 0
May  2 06:52:08  asw2-d-eqiad vccpd[1756]: Member 1, interface vcp-255/1/0 went down
May  2 06:52:09  asw2-d-eqiad fpc1 [EX-BCM PIC] ex_bcm_pic_ifd_config: vcp-255/1/0, enable - 1
May  2 06:52:09  asw2-d-eqiad vccpd[1756]: JTASK_SIGNAL_UNKNOWN: Ignoring unknown signal SIGVTALRM (26)
May  2 06:52:09  asw2-d-eqiad fpc8 Devrt num_vc_ports == 0 unit: 0 dest-mod: 1
May  2 06:52:09  asw2-d-eqiad fpc8 Devrt num_vc_ports == 0 unit: 0 dest-mod: 2
May  2 06:52:09  asw2-d-eqiad fpc8 Devrt num_vc_ports == 0 unit: 0 dest-mod: 3
May  2 06:52:09  asw2-d-eqiad vccpd[1756]: Member 1, interface vcp-255/1/0 came up
May  2 06:52:09  asw2-d-eqiad vccpd[1756]: JTASK_SIGNAL_UNKNOWN: Ignoring unknown signal SIGVTALRM (26)
May  2 06:55:51  asw2-d-eqiad fpc1 [EX-BCM PIC] ex_bcm_linkscan_handler: Link 54 UP
May  2 06:55:51  asw2-d-eqiad fpc1  [EX-BCM PIC] phy_40g_cr4_an_status : Port 54 mii_status = 0x88, ll_adv = 0x1, lp_adv = 0x0
May  2 06:55:51  asw2-d-eqiad fpc1 [EX-BCM PIC] phy_40g_cr4_an_status : Port 54 pause resolution = 0, ll_p = 0x0 lp_p = 0x0
May  2 06:55:51  asw2-d-eqiad fpc1 [EX-BCM PIC] ex_bcm_cr4_get_remote_pause: GET REMOTE PAUSE = 0x0, port 54
May  2 06:55:51  asw2-d-eqiad fpc1 BCM Error: API bcm_port_advert_remote_get(device, port, &ablity) at ex_bcm_get_remote_ability:716 -> Operation disabled
May  2 06:55:51  asw2-d-eqiad fpc1 [EX-BCM PIC] ex_bcm_pic_get_an_info: Failed to get the remote ability for Rear QSFP+ PIC port 0
May  2 06:55:51  asw2-d-eqiad vccpd[1756]: Member 1, interface vcp-255/1/0 went down
May  2 06:55:52  asw2-d-eqiad vccpd[1756]: JTASK_SIGNAL_UNKNOWN: Ignoring unknown signal SIGVTALRM (26)
May  2 06:55:52  asw2-d-eqiad vccpd[1756]: Member 1, interface vcp-255/1/0 came up
May  2 06:55:52  asw2-d-eqiad vccpd[1756]: JTASK_SIGNAL_UNKNOWN: Ignoring unknown signal SIGVTALRM (26)
May  2 06:55:53  asw2-d-eqiad vccpd[1756]: JTASK_SIGNAL_UNKNOWN: Ignoring unknown signal SIGVTALRM (26)

Disabling the link solved the issue:
asw2-d-eqiad> request virtual-chassis vc-port delete pic-slot 1 port 0 member 1

FPC1 still have 2 links up, one to fpc2 and one to fpc3 and fpc8 to 6 and 7 so we still have redundancy.

TODO:

  • Add alerting on relevant syslog messages
  • Decide if:
    • we replace this cable and re-enable the port
    • remove the link fully
    • re-cable the row to match a standard VCF

I think we should remove the link fully (as we still have redundancy) and plan the recabling with T196487

Current cabling (with fpc1-fpc8 removed):

Event Timeline

ayounsi triaged this task as High priority.Sat, May 2, 7:38 AM
ayounsi created this task.
Restricted Application added a project: Operations. · View Herald TranscriptSat, May 2, 7:38 AM
Restricted Application added a subscriber: Aklapper. · View Herald Transcript
ayounsi renamed this task from D1<->D8 link failure to D1<->D8 VC link failure.Sat, May 2, 7:38 AM
ayounsi added subscribers: CDanis, elukey, Joe.
ayounsi updated the task description. (Show Details)Sat, May 2, 7:42 AM
ayounsi added a subscriber: faidon.Mon, May 11, 2:50 PM

The only downside to removing the link fully is that it D1 is 3 hops away D8, which doesn't seem to have been an issue since May 2nd.
Upside is that it brings us closer to a proper cabling diagram.

Mentioned in SAL (#wikimedia-operations) [2020-05-14T15:25:16Z] <XioNoX> disable asw2-d1-eqiad:et-1/1/0 - T251663