Page MenuHomePhabricator

Suboptimal anycast routing from leaf switches
Closed, ResolvedPublic

Description

Configuring Anycast services on the new L3 to the ToR network design brought a limitation, which is easily seen as:

cr1-drmrs> show route 10.3.0.1 
10.3.0.1/32        *[BGP/170] 5d 02:15:08, MED 0, localpref 100
                      AS path: (65001) 64605 I, validation-state: unverified
                    >  to 185.15.58.138 via xe-0/1/2.0
                    [BGP/170] 06:53:39, localpref 100
                      AS path: 4265006001 64605 I, validation-state: unverified
                    >  to 185.15.58.143 via et-0/0/1.0

As the servers peer with the ToR, they're now 1 hop further away from the core routers.
Furthermore the core routers are in a confederation, where for all things equal, prefixes learned from a confederation peer are preferred over "external" peers.
Which means that in the current state of things the drmrs core routers (but same will happen with the eqiad expansion) prefer to sent traffic to eqiad rather than keeping it local.

See Juniper's BGP path selection.

2 fixes are being considered to solve that issue:

1/ Set a local-pref on prefixes learned from the local switches, then remove that local pref when re-advertising prefixes to the other sites (making it local to drmrs)

Removing it is required as a local-pref is transitive through confederations peers. Setting it in drmrs without removing it would attract all traffic for the given prefix.
This solution is the easiest to implement as it requires config change in drmrs only.

2/ Do AS path prepending to anycast prefixes learned directly from the core routers to match the AS path length on the new design infra.
So 10.3.0.1 on cr1-eqiad, will be seen with AS path "64605 64605 I"

The same prefix will be received in drmrs with AS path "65001 64605 64605 I", longer (and less preferred) than the local path.

This 2nd option (proposed by Cathal, thanks!) seems cleaner to me on the longer run, with the downside of a more complex rollout, as traffic will shift briefly to a different site, except for the last site to tackle (as all the other paths will be longer).

Event Timeline

ayounsi triaged this task as High priority.Feb 22 2022, 4:30 PM
ayounsi created this task.
Restricted Application added a subscriber: Aklapper. · View Herald Transcript

2/ Do AS path prepending to anycast prefixes learned directly from the core routers to match the AS path length on the new design infra.
So 10.3.0.1 on cr1-eqiad, will be seen with AS path "64605 64605 I"

The same prefix will be received in drmrs with AS path "65001 64605 64605 I", longer (and less preferred) than the local path.

As far as I understand the prefix will be received in drmrs as "(65001) 64605 64605 I", with the Confed AS not counting towards AS path length.

So the AS-path should seem equal to the CR in drmrs, which should then prefer the route learnt directly from ASW, as it will prefer a local eBGP route to an iBGP (confed) learnt route.

Overall this is my preferred approach still as it seems a clean way to equalize the BGP routes between those announced directly to CRs, and those announced via intermediary ASWs. Not touching local-pref or MED leaves those knobs to be used to drive other policy goals instead.

But the local-pref option is valid I've no particular objection if people want to go that way either.

Change 765268 had a related patch set uploaded (by Ayounsi; author: Ayounsi):

[operations/homer/public@master] Prepend AS to anycast prefixes learned on the core routers

https://gerrit.wikimedia.org/r/765268

Change 765268 merged by jenkins-bot:

[operations/homer/public@master] Prepend AS to anycast prefixes learned on the core routers

https://gerrit.wikimedia.org/r/765268

Mentioned in SAL (#wikimedia-operations) [2022-02-24T14:19:49Z] <XioNoX> Prepend AS to anycast prefixes learned on the core routers - T302315

Change 765549 had a related patch set uploaded (by Ayounsi; author: Ayounsi):

[operations/homer/public@master] Update local_anycast to reflect the anycast prepending

https://gerrit.wikimedia.org/r/765549

Change 765549 merged by jenkins-bot:

[operations/homer/public@master] Update local_anycast to reflect the anycast prepending

https://gerrit.wikimedia.org/r/765549

Change 765568 had a related patch set uploaded (by Cathal Mooney; author: Cathal Mooney):

[operations/homer/public@master] Change CR policy for creating aggregate Anycast routes

https://gerrit.wikimedia.org/r/765568

Current status is that this is virtually solved (removing the last software blocker for drmrs), the CR above will be needed to allow advertising anycast prefixes from drmrs (DoH/test AuthDNS).

Change 765568 merged by jenkins-bot:

[operations/homer/public@master] Change CR policy for creating aggregate Anycast routes

https://gerrit.wikimedia.org/r/765568

Change has now been rolled out. All seems ok, aggregate route is still being created at POPs where it was previously, and announced externally.

It's applied to the CRs in drmrs too, however I believe they are missing these lines of config which would be needed for it to have effect:

set routing-options aggregate route 198.35.27.0/24 policy BGP_from_anycast
set routing-options aggregate route 198.35.27.0/24 community 14907:13
set routing-options aggregate route 185.71.138.0/24 policy BGP_from_anycast
set routing-options aggregate route 185.71.138.0/24 community 14907:13

@ayounsi I will leave it to you to add this as I know you're working on drmrs and there may be reasons I'm not aware of why it's not set currently. But overall hopefully all is good once it's added.

@cmooney thanks!
@ssingh let me know when we're good to advertise DoH from drmrs
@BBlack let me know hwen we're good to advertise nsa.wikimedia.org from drmrs

DoH is advertised from drmrs, I'll leave it to Traffic to decide about the anycast NS.