Page MenuHomePhabricator

cr2-eqiad:FPC3 partial failure (PIC2/3)
Closed, ResolvedPublic

Description

Down are:

  • 2 of the 3 links between cr1 and cr2,
  • one of the transports to codfw,
  • one of the telia transit links
  • one of the Equinix peering links

Everything failed over as expected, opening a JTAC case.

cr2-eqiad> show system alarms 
1 alarms currently active
Alarm time               Class  Description
2022-07-10 23:43:46 UTC  Major  FPC 3 Major Errors - XM Chip Error code: 0x70139
Jul 10 23:43:30  re0.cr2-eqiad fpc3 XMCHIP(1): XMCHIP(1): FI: Reorder cell timeout - Stream 12, Count 6
Jul 10 23:43:30  re0.cr2-eqiad fpc3 XMCHIP(1): XMCHIP(1): FI: Packet CRC error - Stream 12, Count 1
Jul 10 23:43:46  re0.cr2-eqiad fpc3 XMCHIP(1): XMCHIP(1): FI: Reorder cell timeout - Stream 12, Count 5
Jul 10 23:43:46  re0.cr2-eqiad fpc3 XMCHIP(1): XMCHIP(1): FI: Packet CRC error - Stream 19, Count 2
Jul 10 23:43:46  re0.cr2-eqiad fpc3 XMCHIP(1): XMCHIP(1): FI: Aliasing on allocates error - Pipe count 0, Count 4
Jul 10 23:43:46  re0.cr2-eqiad fpc3 XMCHIP(1): XMCHIP(1): FI: Cells dropped due to reorder sequence jumping - Count 1
Jul 10 23:43:47  re0.cr2-eqiad fpc3 Cmerror Op Set: XMCHIP(1): XMCHIP(1): FI: Link sanity checks - Type 4, Seq Number 646, Stream 150, Link0 0x4, Link1 0x10, Link2 0xfff
Jul 10 23:43:47  re0.cr2-eqiad fpc3 Error (0x70139), module: XMCHIP(1), type: FI: Link sanity check and high rate cell underflow errors
Jul 10 23:43:48  re0.cr2-eqiad fpc3 XMCHIP(1): XMCHIP(1): FI: Aliasing on allocates error - Pipe count 0, Count 1
Jul 10 23:43:48  re0.cr2-eqiad fpc3 XMCHIP(1): XMCHIP(1): FI: Aliasing on allocates error - Pipe count 0, Count 1
Jul 10 23:44:00  re0.cr2-eqiad fpc3 PFE 1: 'PFE Disable' action performed. Bringing down ifd xe-3/2/0 191
Jul 10 23:44:00  re0.cr2-eqiad fpc3 PFE 1: 'PFE Disable' action performed. Bringing down ifd xe-3/2/1 192
Jul 10 23:44:00  re0.cr2-eqiad fpc3 PFE 1: 'PFE Disable' action performed. Bringing down ifd xe-3/2/2 193
[...]

Event Timeline

ayounsi triaged this task as High priority.
Restricted Application added a subscriber: Aklapper. · View Herald Transcript

High severity Case Number 2022-0711-508366

ayounsi added a project: ops-eqiad.

Juniper agreed on an RMA, forwarded the email thread to Chris for the shipping details.

@Cmjohnson please sync up with Netops once received for the linecard swap.

Mentioned in SAL (#wikimedia-operations) [2022-07-19T16:18:16Z] <XioNoX> drain traffic away from cr2-eqiad:fpc3 - T312745

Replaced the line card, and placed the old one in the same packaging. Juniper did send us a UPS shipping label with tracking number 1Z7AF3889061293486. It will most likely go out in the morning.

Since the replacement errors rate on one of the interfaces went though the roof: https://librenms.wikimedia.org/graphs/to=1658306400/id=12731/type=port_errors/from=1658133600/

We can start by cleaning the optics and replacing cr2-eqiad:xe-3/0/3.
Please let me know once it's done so I can re-enabled the interface.

Nevermind, tracked in T313337

RMA shipped out by Chris on Tuesday, July 26

Replaced the line card, and placed the old one in the same packaging. Juniper did send us a UPS shipping label with tracking number 1Z7AF3889061293486. It will most likely go out in the morning.