Page MenuHomePhabricator

cr1-codfw linecard failure
Closed, ResolvedPublic0 Estimated Story Points

Description

On May 25 at 06:18:06 one of the main linecard on cr1-codfw (hosting all the ports to the switch stacks) failed in a way that the parent AE interfaces went down and not passing traffic anymore.
Failover to cr2-cofdw went smoothly, as expected, and no service outage was caused.

Mark opened JTAC case 2019-0525-0032 and included a RSI.
As it was late at night Pacific Time before a long weekend and the situation was stable we decided to not do any changes on the system.

JTAC got back to us, TLDR is:

  • Major alarm triggered on FPC0 due to Parity error in the host interface which lead to TOE wedge.

And the matching log:
May 25 06:18:06 re0.cr1-codfw fpc0 XMCHIP(0):XMCHIP(0): HOSTIF: Protect: Parity error for SRAM in bank 0

Recommended action is:

I would suggest you to reboot the FPC0 under maintenance window and check if the alarm and error clear.

I followed up with a concern:

Our concern here is that even if the error clears, it could come back soon after and cause an outage on our production infrastructure.
Is it possible to ensure this won't happen? Especially as it looks like a hardware error.

But it has been raised with:

These errors are mostly transient in nature and after rebooting the FPC, it clears and does not come back.
Action Plan :
I would suggest you reboot the FPC0 followed by physical reseat under maintenance window and check if the alarm and error clears.
In case, if the errors re-surface will process RMA for FPC, however if it doesn’t reappear it means there is no hardware failure and no RMA is needed.

Next steps I want to do are:

  • Explicitly disable et- interfaces (to switch ports)
  • Shutdown linecard request chassis fpc offline slot 0
  • Physically unseat/reseat linecard
  • Start linecard request chassis fpc online slot 0
  • Monitor 2 hours for syslog or symptoms similar to the one above
  • Enable et- interfaces one after the other (within 30min each), while monitoring for errors

If no errors leave it as it for monitoring, if errors, shutdown linecard and follow up with JTAC for RMA.

Event Timeline

ayounsi triaged this task as High priority.May 28 2019, 6:03 PM
ayounsi created this task.
Restricted Application added a project: Operations. · View Herald TranscriptMay 28 2019, 6:03 PM
Restricted Application added a subscriber: Aklapper. · View Herald Transcript
ayounsi updated the task description. (Show Details)May 28 2019, 6:05 PM

Plan seems reasonable based on the info in the description! Maybe wait longer than 2h after the linecard is restarted? Or do we suspect that any recurrence is much less likely with no traffic?

I picked 2h for the sake of picking a number that sounds right, but it's not backed by anything. Any value works for me.

Mentioned in SAL (#wikimedia-operations) [2019-05-29T14:47:29Z] <XioNoX> disable et- interfaces on cr1-codfw - T224511

Mentioned in SAL (#wikimedia-operations) [2019-05-29T14:48:14Z] <XioNoX> request chassis fpc offline slot 0 on cr1-codfw - T224511

Mentioned in SAL (#wikimedia-operations) [2019-05-29T14:51:46Z] <XioNoX> request chassis fpc online slot 0 on cr1-codfw - T224511

Mentioned in SAL (#wikimedia-operations) [2019-05-29T17:13:57Z] <XioNoX> enable cr1-codfw:et-0/0/0 - T224511

Mentioned in SAL (#wikimedia-operations) [2019-05-29T17:44:23Z] <XioNoX> enable cr1-codfw:et-0/0/1 - T224511

Mentioned in SAL (#wikimedia-operations) [2019-05-29T19:09:03Z] <XioNoX> enable cr1-codfw:et-0/2/0 - T224511

Mentioned in SAL (#wikimedia-operations) [2019-05-29T19:45:41Z] <XioNoX> enable cr1-codfw:et-0/2/1 - T224511

ayounsi closed this task as Resolved.May 29 2019, 7:47 PM
ayounsi updated the task description. (Show Details)

Everything seems back to normal. Please reopen if the same issue happen again and we will proceed with a RMA.