Page MenuHomePhabricator

cr1-eqsin 4 onboard interfaces down
Closed, ResolvedPublic

Description

The 4 onboard ports on cr1-eqsin disappeared

Traffic impact: https://librenms.wikimedia.org/graphs/lazy_w=652/to=1525455000/device=159/type=device_bits/from=1525451400/legend=no/
eqsin has been depooled at 17:15.
The initial drop is the time BGP reconverges

Similar symptoms as T187807.

The 4 ports are:
1*Transit
2*Peering
1*Core: asw-eqsin:xe-0/0/2

relevant logs
May  4 17:01:05  re0.cr1-eqsin chassisd[1711]: mrlic: Built-in port license is not installed. Deactivating builtin-ports accordingly.
May  4 17:01:06  re0.cr1-eqsin afeb0 jnh_update_ifd_standby_state IFD: 169, Enable: True
May  4 17:01:06  re0.cr1-eqsin afeb0 Phy Diag: Client xe-2/0/3 not in phy test list
May  4 17:01:06  re0.cr1-eqsin afeb0 Phy Diag: Client xe-2/0/0 not in phy test list
May  4 17:01:06  re0.cr1-eqsin afeb0 Phy Diag: Client xe-2/0/1 not in phy test list
May  4 17:01:06  re0.cr1-eqsin afeb0 Phy Diag: Client xe-2/0/2 not in phy test list
config shows licenses are installed
ayounsi@re0.cr1-eqsin# run show system license    
License usage: 
                                 Licenses     Licenses    Licenses    Expiry
  Feature name                       used    installed      needed 
  scale-subscriber                      0         1000           0    permanent
  scale-l2tp                            0         1000           0    permanent
  scale-mobile-ip                       0         1000           0    permanent
  MX104-2x10Gig-port-0-1                0            1           0    permanent
  MX104-2x10Gig-port-2-3                0            1           0    permanent

Licenses installed: 
  License identifier: [redacted]
  License version: 2
  Features:
    MX104-2x10Gig-port-0-1 - MX104 2X10Gig Builtin Port(xe-2/0/0 & xe-2/0/1) upgrade
      permanent
    MX104-2x10Gig-port-2-3 - MX104 2X10Gig Builtin Port(xe-2/0/2 & xe-2/0/3) upgrade
      permanent

Will follow up with JTAC.

Related Objects

Event Timeline

ayounsi created this task.May 4 2018, 5:37 PM
Restricted Application added a project: Operations. · View Herald TranscriptMay 4 2018, 5:37 PM
Restricted Application added a subscriber: Aklapper. · View Herald Transcript
ema moved this task from Triage to Network on the Traffic board.May 7 2018, 9:47 AM

Current troubleshooting actions based on JTAC suggested next step:

[edit system]
-   commit synchronize;
[edit chassis redundancy]
-    graceful-switchover;
[edit routing-options]
-   graceful-restart;
[edit protocols]
-   layer2-control {
-       nonstop-bridging;
-   }
# commit synchronize full

Here the interfaces came back up.

# show | compare
[edit system]
+   commit synchronize;
[edit chassis redundancy]
+    graceful-switchover;
[edit routing-options]
+   graceful-restart;
# commit synchronize

Interfaces back down.

# commit full

Interfaces back up.

rpd was ~100% CPU and the router was not able to route for ~15min after that commit full.

Follow-up has been asked to the vendor to:
1/ Ensure the interfaces don't go down anymore in the future
2/ Investigate that rpd issue

Mentioned in SAL (#wikimedia-operations) [2018-05-07T19:58:01Z] <XioNoX> removing onboard ports license from cr1-eqsin config - T193897

So you can use either the configuration statement and as long as the configuration active on both REs no affectation should be seeing on license status or use the request system license add, but this one will be installed directly into the Os on that RE, not depending of a configuration statement to be active and properly sync.

To detail it more, there are 2 ways of installing a license:
1/ In the configuration set system license keys "xxxxx"
2/ With an operational command: request system license add xxx

Here the license was installed with 1/.
An unrelated configuration change's commit caused the configuration to be out of sync between the two routing engines, and somehow caused Junos to think the license was missing.
From support (quote above), using method 2/ should remove this failure scenario.
This has been done and router has been stable since.

RPD issue is still being investigated, but unrelated to the interfaces going down.

Mentioned in SAL (#wikimedia-operations) [2018-05-07T21:25:18Z] <XioNoX> re-pool eqsin - T193897

ayounsi closed this task as Resolved.May 8 2018, 4:04 PM
Vvjjkkii renamed this task from cr1-eqsin 4 onboard interfaces down to aldaaaaaaa.Jul 1 2018, 1:12 AM
Vvjjkkii reopened this task as Open.
Vvjjkkii removed ayounsi as the assignee of this task.
Vvjjkkii triaged this task as High priority.
Vvjjkkii updated the task description. (Show Details)
Vvjjkkii removed a subscriber: Aklapper.
ema renamed this task from aldaaaaaaa to cr1-eqsin 4 onboard interfaces down.Jul 2 2018, 9:34 AM
ema closed this task as Resolved.
ema assigned this task to ayounsi.
ema updated the task description. (Show Details)
MacFan4000 raised the priority of this task from High to Needs Triage.Jul 2 2018, 9:51 AM
MacFan4000 added a subscriber: Aklapper.