Page MenuHomePhabricator

Juniper HA audit
Open, NormalPublic

Description

tl;dr;

  • GRES should be enabled on all Juniper dual-RE switches
  • Because of Junos limitation with our current configuration, disable GRES from on dual-RE routers, see T191371

Remove from:

cr1-eqiad - done
cr2-eqiad - done
cr1-eqsin - done
cr2-esams
  • Nonstop bridging should NOT be enabled on any devices (only useful for STP, which we don't use)

Remove from:

asw-esams.mgmt.esams.wmnet
asw2-b-eqiad.mgmt.eqiad.wmnet
fasw-c-eqiad.mgmt.eqiad.wmnet
asw-b-codfw.mgmt.codfw.wmnet
asw1-eqsin.mgmt.eqsin.wmnet
cr1-eqsin.wikimedia.org
fasw-c-codfw.mgmt.codfw.wmnet
asw-c-codfw.mgmt.codfw.wmnet
asw-a-codfw.mgmt.codfw.wmnet
asw2-a-eqiad.mgmt.eqiad.wmnet
asw-d-codfw.mgmt.codfw.wmnet
asw2-c-eqiad.mgmt.eqiad.wmnet
asw2-d-eqiad.mgmt.eqiad.wmnet
  • Nonstop active routing should be enabled on all switches with > 1 RE (handles LACP, BFD, OSPF, BGP, VRRP)
asw1-eqsin.mgmt.eqsin.wmnet - done
asw2-c-eqiad.mgmt.eqiad.wmnet - done
asw2-b-eqiad.mgmt.eqiad.wmnet - done
asw2-ulsfo.mgmt.ulsfo.wmnet - done
csw2-esams.mgmt.esams.wmnet - done
asw-a-eqiad.mgmt.eqiad.wmnet - not supported
asw-b-eqiad.mgmt.eqiad.wmnet - not supported
asw-c-eqiad.mgmt.eqiad.wmnet - not supported
  • graceful-restart should be enabled on devices with 1 RE

Nonstop active routing and graceful-restart are mutually exclusive.
Enable on:

mr1-eqsin.wikimedia.org - done
pfw3-eqiad.wikimedia.org - done
mr1-eqiad.wikimedia.org - done
msw1-codfw.mgmt.codfw.wmnet - done
mr1-ulsfo.wikimedia.org - done
pfw3-codfw.wikimedia.org - done
mr1-codfw.wikimedia.org - done
msw1-eqiad.mgmt.eqiad.wmnet - done
asw2-a5-eqiad.mgmt.eqiad.wmnet

More details on https://www.juniper.net/documentation/en_US/junos/topics/concept/high-availability-features-in-junos-introducing.html

Event Timeline

ayounsi triaged this task as Normal priority.Apr 6 2018, 9:01 PM
ayounsi created this task.
Restricted Application added a project: Operations. · View Herald TranscriptApr 6 2018, 9:01 PM
Restricted Application added a subscriber: Aklapper. · View Herald Transcript
ayounsi updated the task description. (Show Details)Apr 11 2018, 1:37 AM

Change 425552 had a related patch set uploaded (by BBlack; owner: BBlack):
[operations/dns@master] Revert "Depolling eqsin due to router issue"

https://gerrit.wikimedia.org/r/425552

Change 425552 merged by BBlack:
[operations/dns@master] Revert "Depolling eqsin due to router issue"

https://gerrit.wikimedia.org/r/425552

From JTAC, the nonstop-routing issue most likely have been caused by a Junos bug where the following commit sometimes enables nonstop-routing before disabling graceful-restart, while they are mutually exclusive.

[edit routing-options]
-   graceful-restart;
+   nonstop-routing;

Suggested approach by JTAC is to disable graceful-restart and enable nonstop-routing in two different commits.