Page MenuHomePhabricator

cr1-codfw:fpc5 partial failure
Closed, ResolvedPublic

Description

14h ago those 2 interfaces:
xe-5/3/0 down down Core: cr2-codfw:xe-5/3/0
xe-5/3/1 down down Transit: Telia

Looking at the almost rolled over logs I was able to find:

cr1-codfw> show log messages.8.gz | match xe-5/3/1
Jun  1 16:24:27  re0.cr1-codfw MGMT:rpd[5491]: EVENT <UpDown> xe-5/3/1 index 216 <Broadcast Multicast> address #0 64.87.88.f2.73.69
Jun  1 16:24:27  re0.cr1-codfw MGMT:rpd[5491]: STP handler: IFD=xe-5/3/1, op=change, state=Discarding, Topo change generation=0
Jun  1 16:24:27  re0.cr1-codfw fpc5 PFE 3: 'PFE Disable' action performed. Bringing down ifd xe-5/3/1 216
Jun  1 16:24:27  re0.cr1-codfw MGMT:rpd[5491]: STP handler: IFD=xe-5/3/1, op=change, state=Discarding, Topo change generation=0
Jun  1 16:24:27  re0.cr1-codfw rpd[5452]: EVENT <UpDown> xe-5/3/1.0 index 389 <Broadcast Multicast> address #0 64.87.88.f2.73.69
Jun  1 16:24:27  re0.cr1-codfw rpd[5452]: EVENT UpDown xe-5/3/1.0 index 389 80.239.192.102/30 -> 80.239.192.103 <Broadcast Multicast Localup>
Jun  1 16:24:27  re0.cr1-codfw rpd[5452]: EVENT UpDown xe-5/3/1.0 index 389 2001:2000:3080:af4::2/64 -> zero-len <Broadcast Multicast Localup>
Jun  1 16:24:27  re0.cr1-codfw rpd[5452]: EVENT UpDown xe-5/3/1.0 index 389 fe80::6687:88ff:fef2:7369/64 -> zero-len <Broadcast Multicast Localup>
Jun  1 16:24:27  re0.cr1-codfw rpd[5452]: EVENT <UpDown> xe-5/3/1 index 216 <Broadcast Multicast> address #0 64.87.88.f2.73.69
Jun  1 16:24:27  re0.cr1-codfw rpd[5452]: krt unsolic client: Received IPv6 address 2001:2000:3080:af4::2 on ifl xe-5/3/1.0. Flag:2.
Jun  1 16:24:27  re0.cr1-codfw rpd[5452]: krt unsolic client: Received IPv6 address fe80::6687:88ff:fef2:7369 on ifl xe-5/3/1.0. Flag:2.
Jun  1 16:24:27  re0.cr1-codfw rpd[5452]: STP handler: IFD=xe-5/3/1, op=change, state=Discarding, Topo change generation=0
Jun  1 16:24:27  re0.cr1-codfw mib2d[4968]: SNMP_TRAP_LINK_DOWN: ifIndex 537, ifAdminStatus up(1), ifOperStatus down(2), ifName xe-5/3/1
Jun  1 16:24:27  re0.cr1-codfw kernel: if_msg_ifd_cmd_tlv_decode ifd xe-5/3/1 #216 down with ASIC Error
Jun  1 16:24:30  re0.cr1-codfw fpc5 IFFPC: IFD(xe-5/3/1, 216) ASIC error notification
re0.cr1-codfw> show log messages.8.gz | match fpc5
Jun  1 16:23:59  re0.cr1-codfw fpc5 MQCHIP(3) WI upoh flow control exception
Jun  1 16:24:12  re0.cr1-codfw fpc5 jnh_update_ifd_standby_state IFD: 215, Enable: False
Jun  1 16:24:14  re0.cr1-codfw fpc5 Error (0x10409), module: TOE-MQ-3:0:0, type: MQ_TOE TX Blocked Major error
Jun  1 16:24:27  re0.cr1-codfw fpc5 PFE 3: 'PFE Disable' action performed. Bringing down ifd xe-5/3/0 215
Jun  1 16:24:27  re0.cr1-codfw fpc5 PFE 3: 'PFE Disable' action performed. Bringing down ifd xe-5/3/1 216
Jun  1 16:24:27  re0.cr1-codfw fpc5 PFE 3: 'PFE Disable' action performed. Bringing down ifd xe-5/3/2 217
Jun  1 16:24:27  re0.cr1-codfw fpc5 PFE 3: 'PFE Disable' action performed. Bringing down ifd xe-5/3/3 218
Jun  1 16:24:30  re0.cr1-codfw fpc5 Cmerror Op Set: TOE-MQ-3:0:0: TOE MQ.3.0.0 : SetErr - ** WEDGE DETECTED IN PFE 3 stream 0 TOE host packet transfer: TOE toAsic path blocked (code 0x9)
Jun  1 16:24:30  re0.cr1-codfw fpc5 IFFPC: IFD(xe-5/3/0, 215) ASIC error notification
Jun  1 16:24:30  re0.cr1-codfw fpc5 IFFPC: IFD(xe-5/3/1, 216) ASIC error notification
Jun  1 16:24:30  re0.cr1-codfw fpc5 IFFPC: IFD(xe-5/3/2, 217) ASIC error notification
Jun  1 16:24:30  re0.cr1-codfw fpc5 IFFPC: IFD(xe-5/3/3, 218) ASIC error notification
Jun  1 16:24:31  re0.cr1-codfw fpc5 MIC(5/3) link 0 SFP laser bias current low  alarm set
Jun  1 16:24:31  re0.cr1-codfw fpc5 MIC(5/3) link 0 SFP output power low  alarm set
Jun  1 16:24:31  re0.cr1-codfw fpc5 MIC(5/3) link 0 SFP laser bias current low  warning set
Jun  1 16:24:31  re0.cr1-codfw fpc5 MIC(5/3) link 0 SFP output power low  warning set
Jun  1 16:24:31  re0.cr1-codfw fpc5 MIC(5/3) link 1 SFP laser bias current low  alarm set
Jun  1 16:24:31  re0.cr1-codfw fpc5 MIC(5/3) link 1 SFP output power low  alarm set
Jun  1 16:24:31  re0.cr1-codfw fpc5 MIC(5/3) link 1 SFP laser bias current low  warning set
Jun  1 16:24:31  re0.cr1-codfw fpc5 MIC(5/3) link 1 SFP output power low  warning set
Jun  1 16:24:32  re0.cr1-codfw fpc5 Error (0x20004), module: Host Loopback, type: Host Loopback Path Id 3
Jun  1 16:24:35  re0.cr1-codfw fpc5 Cmerror Op Set: Host Loopback: HOST LOOPBACK WEDGE DETECTED IN PATH ID 3
Jun  1 16:24:36  re0.cr1-codfw fpc5 PFE[3] Liveness Thread Stopped, interval = 0
Jun  1 16:24:37  re0.cr1-codfw fpc5 PFE[3] CC[0] Fabric Probe Stopped, interval = 50 ms

And indeed:

cr1-codfw> show system alarms 
2 alarms currently active
Alarm time               Class  Description
2020-06-01 17:47:54 UTC  Major  FPC 0 Hard errors
2020-06-01 16:24:14 UTC  Major  FPC 5 Major Errors - TOE Error code: 0x10409

Which went unnoticed as the alerts were ACKed for the FPC0 issue.

Event Timeline

ayounsi triaged this task as High priority.Jun 2 2020, 6:44 AM
ayounsi created this task.
Restricted Application added subscribers: Liuxinyu970226, Aklapper. · View Herald Transcript

Opened JTAC case 2020-0601-0882, at this point it's too much of a coincidence to not think of a backplane issue.

Please find the below KB for TOE chip memory errors reported on routers with MPC-3D-16XGE-SFPP FPCs.
https://kb.juniper.net/InfoCenter/index?page=content&id=KB31235
These messages could indicate a hardware issue or a transient memory issue.
reboot the FPC and monitor.

Change 601741 had a related patch set uploaded (by Ayounsi; owner: Ayounsi):
[operations/dns@master] Depool codfw for network work

https://gerrit.wikimedia.org/r/601741

Change 601741 merged by Ayounsi:
[operations/dns@master] Depool codfw for network work

https://gerrit.wikimedia.org/r/601741

Mentioned in SAL (#wikimedia-operations) [2020-06-02T14:49:03Z] <XioNoX> prefer eqsin-ulsfo tunnel - T254216

FPC reboot solved the issue. Will re-open if it re-appears.