Page MenuHomePhabricator

Juniper HA audit
Closed, ResolvedPublic

Description

tl;dr;

  • GRES should be enabled on all Juniper dual-RE devices

Because of Junos limitation with our current configuration, disable GRES from on dual-RE routers, see T191371
Test then re-enable GRES on:

  • cr1-eqiad
  • cr2-eqiad
  • cr1-eqsin
  • cr2-esams

  • Nonstop bridging should NOT be enabled on any devices (only useful for STP, which we don't use)

Remove from:

  • asw2-a-eqiad.mgmt.eqiad.wmnet
  • asw2-b-eqiad.mgmt.eqiad.wmnet
  • asw2-c-eqiad.mgmt.eqiad.wmnet
  • asw2-d-eqiad.mgmt.eqiad.wmnet
  • asw-a-codfw.mgmt.codfw.wmnet
  • asw-b-codfw.mgmt.codfw.wmnet
  • asw-c-codfw.mgmt.codfw.wmnet
  • asw-d-codfw.mgmt.codfw.wmnet
  • fasw-c-eqiad.mgmt.eqiad.wmnet
  • fasw-c-codfw.mgmt.codfw.wmnet
  • asw2-esams.mgmt.esams.wmnet
  • asw1-eqsin.mgmt.eqsin.wmnet

  • Nonstop active routing should be enabled on all switches with > 1 RE (handles LACP, BFD, OSPF, BGP, VRRP)
  • asw1-eqsin.mgmt.eqsin.wmnet

  • Nonstop active routing should NOT be enabled on any router

  • graceful-restart should be enabled on all devices where NSR is not configured (nonstop active routing and graceful-restart are mutually exclusive)

  • Write matching homer changes to make it systematic.

More details on https://www.juniper.net/documentation/en_US/junos/topics/concept/high-availability-features-in-junos-introducing.html

Event Timeline

ayounsi triaged this task as Medium priority.Apr 6 2018, 9:01 PM
ayounsi created this task.
Restricted Application added a subscriber: Aklapper. · View Herald Transcript

Change 425552 had a related patch set uploaded (by BBlack; owner: BBlack):
[operations/dns@master] Revert "Depolling eqsin due to router issue"

https://gerrit.wikimedia.org/r/425552

Change 425552 merged by BBlack:
[operations/dns@master] Revert "Depolling eqsin due to router issue"

https://gerrit.wikimedia.org/r/425552

From JTAC, the nonstop-routing issue most likely have been caused by a Junos bug where the following commit sometimes enables nonstop-routing before disabling graceful-restart, while they are mutually exclusive.

[edit routing-options]
-   graceful-restart;
+   nonstop-routing;

Suggested approach by JTAC is to disable graceful-restart and enable nonstop-routing in two different commits.

Next step here as this is something that needs to be squared away.

Decide what should be configured for which type of devices, respectively:

  • Dual-REs routers: GRES + GR or NSR
    • The GRES bug should be fixed with T247073
    • On the paper, NSR better for RE crash. As it syncs the states, a failover after a crash should be seamless. On the other hand, real life experience shows that syncing states is brittle and things (RE included, don't fail as we want them to)
    • In addition it's not clear if NSR would help in the case where we change the TCP-MSS of an interface as it's mutually exclusive with graceful-restart (see T246338)
    • From the doc "NSR is advantageous in networks in which neighbor routers (or switches) do not support graceful restart protocol extensions." Most likely only a few IXPs peers don't support GR (data point: ~90 out of 400 in AMS-IX, 30 out of 190 in Equinix Ashburn)
    • ISSU is not mature enough
    • For those reasons I think that GRES + graceful-restart is the way to go for now
  • All other statements from the description still stand
  • Once agreed, audit all devices to make sure we capture what needs to change
  • If a device isn't running a compatible Junos version, document it so we take care of it when possible

Change 592938 had a related patch set uploaded (by Ayounsi; owner: Ayounsi):
[operations/homer/public@master] Add graceful-switchover multiple RE devices

https://gerrit.wikimedia.org/r/592938

Change 577564 had a related patch set uploaded (by Ayounsi; owner: CDanis):
[operations/homer/public@master] add graceful-restart to CRs

https://gerrit.wikimedia.org/r/577564

Change 592938 merged by jenkins-bot:
[operations/homer/public@master] Add graceful-switchover to multiple RE devices

https://gerrit.wikimedia.org/r/c/operations/homer/public/ /592938

Change 577564 merged by jenkins-bot:
[operations/homer/public@master] add graceful-restart to CRs

https://gerrit.wikimedia.org/r/c/operations/homer/public/ /577564

Change 609139 had a related patch set uploaded (by Ayounsi; owner: Ayounsi):
[operations/homer/public@master] Remove nonstop-bridging from switches

https://gerrit.wikimedia.org/r/c/operations/homer/public/ /609139

Mentioned in SAL (#wikimedia-operations) [2020-08-03T13:11:52Z] <XioNoX> remove nonstop-bridging from asw-a-codfw - T191667

Mentioned in SAL (#wikimedia-operations) [2020-08-03T13:12:58Z] <XioNoX> remove nonstop-bridging from asw-b-codfw - T191667

Mentioned in SAL (#wikimedia-operations) [2020-08-03T13:14:25Z] <XioNoX> remove nonstop-bridging from asw-c-codfw - T191667

Mentioned in SAL (#wikimedia-operations) [2020-08-03T13:15:40Z] <XioNoX> remove nonstop-bridging from asw-d-codfw - T191667

Mentioned in SAL (#wikimedia-operations) [2020-08-03T14:27:45Z] <XioNoX> remove nonstop-bridging from fasw-c-codfw - T191667

Mentioned in SAL (#wikimedia-operations) [2020-08-04T07:28:10Z] <XioNoX> remove nonstop-bridging from asw2-esams - T191667

Mentioned in SAL (#wikimedia-operations) [2020-08-04T07:29:11Z] <XioNoX> remove nonstop-bridging from eqiad asw2 switches - T191667

Mentioned in SAL (#wikimedia-operations) [2020-08-04T07:32:30Z] <XioNoX> remove nonstop-bridging from fasw-c-eqiad switches - T191667

Change 609139 merged by jenkins-bot:
[operations/homer/public@master] Remove nonstop-bridging from switches

https://gerrit.wikimedia.org/r/609139

ayounsi updated the task description. (Show Details)