Page MenuHomePhabricator

Enabling graceful-switchover causes core dumps on cr1-codfw
Closed, ResolvedPublic


Enabling graceful-switchover on cr1-codfw causes error in the logs and core dumps.

Opened Juniper case 2018-0403-0831.

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald Transcript

Juniper's reply:

During the cleanup process, ksyncd will check for public nexthops to make sure that there are no public next hops remaining. If ksyncd finds a public nexthop hanging without getting cleaned up, it will set initialization error(KSYNCD_ERROR_INIT), which leads to this connection/initialization error. From the message logs, it looks like ksyncd is facing some issue during NH index cleanup, which looks suspicious. From the RSI, I can see that the FXP0 is in a logical system which is not supported and hence GRES is not completing correctly.
Please remove the fxp0 from the logical systems and then re-enable GRES.

I followed up as we have graceful-switchover enabled on routers with fxp0 in a logical-system.

Relevant KB entry:
JTAC's opinion on why it's working on some routers is that we're being "lucky".
Which raises the risk of a RE failure not being handled properly.
Best case being the RE going down and the redundant router taking all the load.
Worse case being a partial failure where the RE failover fails in a way traffic is blackholed.

Closing this task, following up with actions to do in T191667.