On May 25 at 06:18:06 one of the main linecard on cr1-codfw (hosting all the ports to the switch stacks) failed in a way that the parent AE interfaces went down and not passing traffic anymore.
Failover to cr2-cofdw went smoothly, as expected, and no service outage was caused.
Mark opened JTAC case 2019-0525-0032 and included a RSI.
As it was late at night Pacific Time before a long weekend and the situation was stable we decided to not do any changes on the system.
JTAC got back to us, TLDR is:
- Major alarm triggered on FPC0 due to Parity error in the host interface which lead to TOE wedge.
And the matching log:
May 25 06:18:06 re0.cr1-codfw fpc0 XMCHIP(0):XMCHIP(0): HOSTIF: Protect: Parity error for SRAM in bank 0
Recommended action is:
I would suggest you to reboot the FPC0 under maintenance window and check if the alarm and error clear.
I followed up with a concern:
Our concern here is that even if the error clears, it could come back soon after and cause an outage on our production infrastructure.
Is it possible to ensure this won't happen? Especially as it looks like a hardware error.
But it has been raised with:
These errors are mostly transient in nature and after rebooting the FPC, it clears and does not come back.
Action Plan :
I would suggest you reboot the FPC0 followed by physical reseat under maintenance window and check if the alarm and error clears.
In case, if the errors re-surface will process RMA for FPC, however if it doesn’t reappear it means there is no hardware failure and no RMA is needed.
Next steps I want to do are:
- Explicitly disable et- interfaces (to switch ports)
- Shutdown linecard request chassis fpc offline slot 0
- Physically unseat/reseat linecard
- Start linecard request chassis fpc online slot 0
- Monitor 2 hours for syslog or symptoms similar to the one above
- Enable et- interfaces one after the other (within 30min each), while monitoring for errors
If no errors leave it as it for monitoring, if errors, shutdown linecard and follow up with JTAC for RMA.