Page MenuHomePhabricator

Faulty A6/A7 VC link
Closed, ResolvedPublic0 Estimated Story Points

Description

The VC link between asw2-A6:1/0 and asw2-A7:0/52 failed yesterday in a way where it was silently dropping packets.

We can see the errors with:

asw2-a-eqiad> show virtual-chassis vc-port statistics member 6 vcp-255/1/0 extensive
RX
CRC alignment errors:      43488980

Disabling the asw2-a6 side of that link fixed the issue.

On the physical side it's a regular fiber with QSFP optics on each end. So 3 parts are most likely at fault, the optics and the fiber as well as 2 possible cause, a physical move "bump" in the optic or fiber damaging them, or the power fluctuation damaging an optic.
Because it's a VC link there is no way of getting the optics levels (afaik).

Next steps:

  • Check how many spares QSFP we have (not needed, DAC)
  • Investigate the fiber for any obvious damage
  • Even if no obvious damage, run a new fiber
  • Keep a mtr/ping running between mw1312<->puppetmaster1001
  • Warn people on -sre that we're enabling the previously faulty interface
  • Enable interface request virtual-chassis vc-port set interface member 6 vcp-255/1/0
  • Monitor packet loss
  • Check VC path goes through directly from fpc6-fpc7 with for example show virtual-chassis vc-path source-interface ge-6/0/7 destination-interface ge-7/0/16
  • If issues, disable interface, replace fpc7:0/52 QSFP, try the above again
  • If issues do the same with fpc6:1/0 QSFP

Related Objects

Event Timeline

ayounsi triaged this task as High priority.Jul 24 2019, 1:56 AM
ayounsi created this task.

Mentioned in SAL (#wikimedia-operations) [2019-07-24T14:54:28Z] <XioNoX> cleared vc ports stats on asw2-a-eqiad - T228823

Mentioned in SAL (#wikimedia-operations) [2019-07-24T15:02:38Z] <XioNoX> re-enable vc link between asw2-a6 and asw2-a7 - T228823

All done, no more errors or packet loss.