Page MenuHomePhabricator

cr3-esams crash
Closed, ResolvedPublic

Description

Tracking task for cr3-esams routing engine troubles

Event Timeline

Opened Juniper case 2019-1026-0004.

Hi,

earlier today we had re0 on our MX480 chassis crash and be stuck on the "db>" prompt, see attached filed (re0-cr3-esams-crash.log) for the data gathering command done. following https://kb.juniper.net/InfoCenter/index?page=content&id=KB20635&actp=METADATA and https://kb.juniper.net/InfoCenter/index?page=content&id=KB10751&actp=METADATA&act=login
Running the "panic" command seems to have caused re1 (at this point in a healthy state) to restart as the same time as re0. Bringing the whole router down.

Once I got both REs back on console, I noticed that re0 was not healthy, as in show chassis hardware timeout with "error: the chassis-control subsystem is not responding to management requests", no OSPF or BGP sessions are established, etc...

Trying to do a RE failover times out with the same error as above.

The "re0-cr3-esams-crash.log" shows everything I typed (and the output) on the console link to re0.

The device also have 3 coredumps, and I'll attach the RSI as well as the logs shortly.

The current state of the router is that it can only be reached on its management and console port, no revenue ports are working.

JTAC is investigating the logs and core dumps.

Change 546334 had a related patch set uploaded (by Ayounsi; owner: Ayounsi):
[operations/puppet@production] Smokeping, remove cr3-esams while it's not working

https://gerrit.wikimedia.org/r/546334

Change 546334 merged by Ayounsi:
[operations/puppet@production] Smokeping, remove cr3-esams while it's not working

https://gerrit.wikimedia.org/r/546334

We have found matching PR1179822, Chassisd might crash if lo0 filter is configured without allowing communication between RE and VM-host on RE. As a result,the internal interfaces are incorrectly examined by lo0 filter, none of the FPC's will be online and no interface will be created.
Workaround
Please allow the internal communication between RE and VM-Host in lo0 filter (if lo0 filter being used)
Read more at: https://prsearch.juniper.net/InfoCenter/index?page=prcontent&id=PR1179822

Going to push the following:

[edit firewall family inet filter loopback4]
       term return-tcp { ... }
+      term allow_vmhost {
+          from {
+              source-address {
+                  192.168.1.0/24;
+              }
+              destination-address {
+                  192.168.1.0/24;
+              }
+          }
+          then accept;
+      }
       term deny_all { ... }

Mentioned in SAL (#wikimedia-operations) [2019-10-29T08:15:26Z] <XioNoX> push term allow_vmhost ro cr3-esams loopback4 filter - T236598

All the interfaces are back up and cr3-esams is now reachable and in service.

One issue persists, re0 can't reach re1:

ayounsi@re0.cr3-esams# commit check 
warning: Could not connect to re1 : No route to host
warning: Cannot connect to other RE, ignoring it
configuration check succeeds

re1 is unresponsive, even through console.
We have 2 options to try to power cycle it:

  • Have someone onsite unseat/reseat the card (non disruptive)
  • Power cycle the whole router (disruptive)

I'd suggest the 2nd option as it can be done remotely (and making sure peers are disabled and VRRP master is on cr2).
If that doesn't solve the issue we will have to proceed with an RMA.

Power cycled CB1 (hosting re1) following https://kb.juniper.net/InfoCenter/index?page=content&id=KB14278&cat=JUNOS&actp=LIST and RE1 is now back online in a healthy state.

Change 547647 had a related patch set uploaded (by Ayounsi; owner: Ayounsi):
[operations/homer/public@master] Add term vmhost to cr loopback4 filter

https://gerrit.wikimedia.org/r/547647

Change 547647 merged by Ayounsi:
[operations/homer/public@master] Add term vmhost to cr loopback4 filter

https://gerrit.wikimedia.org/r/547647