Page MenuHomePhabricator

BGP Policy on aggregate routes prevents them being created in some circumstances.
Closed, ResolvedPublic

Description

It was discovered there is a gap in the logic for originating aggregate routes into BGP on our core routers. This was discovered on the back of a query from Bell Canada asking why we were not announcing IPv4 prefix 208.80.154.0/23 to them from eqord, but only from eqiad.

The current configuration is set up to create certain aggregate routes if any longer prefixes from within them are present in the BGP RIB. In our case it should mean that the presence of routes originated from LVS at remote sites will trigger the creation of the configured aggregates. Looking at the above prefix the relevant config on cr2-eqord is like this:

set routing-options aggregate route 208.80.154.0/23 policy BGP_from_LVS

set policy-options policy-statement BGP_from_LVS term BGP_core_and_local_LVS from protocol bgp
set policy-options policy-statement BGP_from_LVS term BGP_core_and_local_LVS from as-path core_and_local_LVS
set policy-options policy-statement BGP_from_LVS term BGP_core_and_local_LVS then accept
set policy-options policy-statement BGP_from_LVS then reject

set policy-options as-path core_and_local_LVS "^(65002|65001)? 64600.*"

The one constraint on just creating the aggregate is the as-path regex in the last line. All LVS instances use ASN 64600, so the regex basically says "routes originated from LVS ASN, with either AS65002 (codfw) or AS65001 (eqiad) in the path, but not both". The intent of the configuration is to create the aggregate route only if it is being learnt directly from the site where it is being used (codfw or eqiad), but avoid doing so if it is being learnt from another site. The idea is we don't want to announce a route in Chicago for an Ashburn prefix if our network wants to get there via Dallas, for instance if our transport links from Chicago to Ashburn are down.

A problem can occur because of the interrelationship of BGP and OSPF, however, and how the BGP best-path algorithm works when confederations are in use. Consider the current BGP best-path for prefix 208.80.154.224/32 (origingated by lvs1013 in eqiad), on cr2-eqord:

cmooney@cr2-eqord> show route protocol bgp 208.80.154.224/32 terse     

inet.0: 833225 destinations, 2049355 routes (832285 active, 1 holddown, 1511 hidden)
Restart Complete
+ = Active Route, - = Last Active, * = Both

A V Destination        P Prf   Metric 1   Metric 2  Next hop        AS path
* ? 208.80.154.224/32  B 170        100          0                  (65002 65001) 64600 I
  unverified                                       >208.80.154.208
  ?                    B 170        100          0                  (65001) 64600 I
  unverified                                       >208.80.154.208
  ?                    B 170        100          0                  (65001) 64600 I
  unverified                                       >208.80.154.208

Local preference and MED are the same on all 3 of these routes. The router has ended up using the router-id attribute from each as tie-break, resulting in the one learnt from codfw being selected. Notably the fact that there are 2 sub-as's (65002 65001) in the path for this route, as opposed to only 1 (65001) on those learnt directly from eqiad, is not considered when selecting the best path. This is normal behaviour with BGP confederations, the sub-as path is not considered when comparing as-path length (RFC5065 5.3.3).

For our normal routing to this prefix it is not in any way an issue that BGP has selected the route that propagated through codfw to get to eqord. Whether learnt from codfw, or direct from eqiad, the next-hop in the BGP message is the IP of the originating LVS server in eqiad, 10.64.1.13:

cmooney@cr2-eqord> show route protocol bgp 208.80.154.224/32 detail | match "Source|Protocol Next Hop" 
                Source: 208.80.153.193
                Protocol next hop: 10.64.1.13
                Source: 208.80.154.196
                Protocol next hop: 10.64.1.13
                Source: 208.80.154.197
                Protocol next hop: 10.64.1.13

So regardless of which BGP route is selected, the same (indirect) next-hop IP is going to be used, and in normal circumstance traffic will route directly to eqiad due to lower IGP cost:

cmooney@cr2-eqord> show route 10.64.1.13 

inet.0: 833306 destinations, 2049530 routes (832369 active, 0 holddown, 1493 hidden)
Restart Complete
+ = Active Route, - = Last Active, * = Both

10.64.0.0/22       *[OSPF/10] 1w5d 21:47:37, metric 242
                    > to 208.80.154.208 via xe-0/1/5.0

Going back to the BGP policy on the aggregate route, however, there is an issue. Despite the traffic still routing directly to eqiad, the fact BGP has selected the route learnt from codfw means the as-path regex isn't matched. The regex only permits the BGP routes learnt directly from eqiad, neither of which aren't the selected best path:

cmooney@cr2-eqord> show route protocol bgp 208.80.154.224/32 aspath-regex "^(65002|65001)? 64600.*" 

inet.0: 833377 destinations, 2049658 routes (832435 active, 0 holddown, 1498 hidden)
Restart Complete
+ = Active Route, - = Last Active, * = Both

208.80.154.224/32   [BGP/170] 1w5d 21:57:29, MED 0, localpref 100, from 208.80.154.196
                      AS path: (65001) 64600 I, validation-state: unverified
                    > to 208.80.154.208 via xe-0/1/5.0
                    [BGP/170] 1w5d 21:57:29, MED 0, localpref 100, from 208.80.154.197
                      AS path: (65001) 64600 I, validation-state: unverified
                    > to 208.80.154.208 via xe-0/1/5.0

Event Timeline

CDanis triaged this task as Medium priority.May 19 2021, 3:09 PM
CDanis added a project: netops.

How to overcome this is a tricky question. I don't believe an AS-path filter on the aggregate can be used as:

  • We know that route with longer sub-as path can be selected as best.
  • Due to use of confederations the next-hop is preserved.
  • The outbound interface for that next-hop is not based on anything in BGP.
  • AS-Path on the selected BGP route thus doesn't tell us anything about the satus of eqord -> eqiad link(s).

In terms of a solution there are two broad approaches, and possibly some hybrids of both:

  1. Change the BGP policy so that the route with the longer sub-AS path will not be selected, keeping policy on the aggregate the same.
  2. Change the policy on the aggregate so it will be created no matter what (sub) AS-path is on the BGP best route, but use some other criteria to not create it if the IGP route isn't optimal.

A possibility for number 1 would be:

  • Attach communities to prefixes at the sites they are originated (say ingress from LVS, or by LVS itself).
  • Match those communities ingress from peers, and increase local-pref, say for routes learnt from eqiad peer with eqiad origin community.

Or number 2:

  • Configure a different policy on the aggregate route statement for each prefix.
  • Match the "next-hop" IP in the BGP route so it will only match if the next-hop is via direct link to site using that aggregate.

Overall I'm more inclined to do number 1. But there are probably lots of ways to approach this we will need to consider what is best.

After discussion with @ayounsi on IRC he suggested looking at the use of the following command to address this:

set protocols bgp group <group_name> metric-out minimum-igp

I've used this in the past for other reasons and 100% agree it is the best way to proceed. Ultimately it addresses the root of the issue, the difference in what the BGP process and OSPF process sees. The command causes routers announcing prefixes to set the BGP MED/metric value based on the IGP/OSPF cost to the next-hop IP.

So in our case we expect the routers at codfw to set the value to the metric of the OSPF cost to get to eqiad, when announcing the route to eqord. In other words to set it to match the OSPF metric on the WAN links from codfw to eqiad.

There was some question over what the routers in eqiad would do, when announcing locally originated BGP routes to eqord. As the route originates in eqiad, and the next-hop is local/direct, there is no OSPF cost on the route, it's just local/connected. To determine the behaviour in this scenario I labbed up the setup to see.

NOTE: All below is from local lab vMX instances running on my machine, not production boxes.

The inital lab configuration resulted in the following on the pretend cr2-eqord device:

admin@cr2-eqord> show route protocol bgp 208.80.154.224/32 detail 

inet.0: 13 destinations, 15 routes (13 active, 0 holddown, 0 hidden)
208.80.154.224/32 (2 entries, 1 announced)
        *BGP    Preference: 170/-101
                Next hop type: Indirect, Next hop index: 0
                Address: 0xb675cf0
                Next-hop reference count: 3
                Source: 208.80.153.193
                Next hop type: Router, Next hop index: 586
                Next hop: 208.80.154.208 via ge-0/0/1.0, selected
                Session Id: 0x140
                Protocol next hop: 10.64.1.13
                Indirect next hop: 0xba9c580 1048574 INH Session ID: 0x144
                State: <Active Int Ext>
                Local AS: 65020 Peer AS: 65002
                Age: 1:17 	Metric: 0 	Metric2: 240 
                Validation State: unverified 
                Task: BGP_65002.208.80.153.193+179
                Announcement bits (4): 0-KRT 4-BGP_RT_Background 5-Resolve tree 1 6-Resolve tree 2 
                AS path: (65002 65001) 64600 ? 
                Accepted
                Localpref: 100
                Router ID: 208.80.153.193
         BGP    Preference: 170/-101
                Next hop type: Indirect, Next hop index: 0
                Address: 0xb675cf0
                Next-hop reference count: 3
                Source: 208.80.154.197
                Next hop type: Router, Next hop index: 586
                Next hop: 208.80.154.208 via ge-0/0/1.0, selected
                Session Id: 0x140
                Protocol next hop: 10.64.1.13
                Indirect next hop: 0xba9c580 1048574 INH Session ID: 0x144
                State: <NotBest Int Ext>
                Inactive reason: Not Best in its group - Router ID
                Local AS: 65020 Peer AS: 65001
                Age: 1:18 	Metric: 0 	Metric2: 240 
                Validation State: unverified 
                Task: BGP_65001.208.80.154.197+179
                AS path: (65001) 64600 ? 
                Accepted
                Localpref: 100
                Router ID: 208.80.154.197

As in production you can see that the codfw announced prefix is preferred, tie-breaking on router-id value.

I then added the following command on the simulated cr2-eqiad and cr2-codfw routers:

set protocols bgp group Confed_eqord metric-out minimum-igp

This changed the preferred route as follows:

admin@cr2-eqord> show route protocol bgp 208.80.154.224/32 detail    

inet.0: 13 destinations, 15 routes (13 active, 0 holddown, 0 hidden)
208.80.154.224/32 (2 entries, 1 announced)
        *BGP    Preference: 170/-101
                Next hop type: Indirect, Next hop index: 0
                Address: 0xb675cf0
                Next-hop reference count: 3
                Source: 208.80.154.197
                Next hop type: Router, Next hop index: 586
                Next hop: 208.80.154.208 via ge-0/0/1.0, selected
                Session Id: 0x140
                Protocol next hop: 10.64.1.13
                Indirect next hop: 0xba9c580 1048574 INH Session ID: 0x146
                State: <Active Int Ext>
                Local AS: 65020 Peer AS: 65001
                Age: 23 	Metric: 0 	Metric2: 240 
                Validation State: unverified 
                Task: BGP_65001.208.80.154.197+179
                Announcement bits (4): 0-KRT 4-BGP_RT_Background 5-Resolve tree 1 6-Resolve tree 2 
                AS path: (65001) 64600 ? 
                Accepted
                Localpref: 100
                Router ID: 208.80.154.197
         BGP    Preference: 170/-101
                Next hop type: Indirect, Next hop index: 0
                Address: 0xb675cf0
                Next-hop reference count: 3
                Source: 208.80.153.193
                Next hop type: Router, Next hop index: 586
                Next hop: 208.80.154.208 via ge-0/0/1.0, selected
                Session Id: 0x140
                Protocol next hop: 10.64.1.13
                Indirect next hop: 0xba9c580 1048574 INH Session ID: 0x146
                State: <NotBest Int Ext Changed>
                Inactive reason: Not Best in its group - Route Metric or MED comparison
                Local AS: 65020 Peer AS: 65002
                Age: 23 	Metric: 340 	Metric2: 240 
                Validation State: unverified 
                Task: BGP_65002.208.80.153.193+179
                AS path: (65002 65001) 64600 ? 
                Accepted
                Localpref: 100
                Router ID: 208.80.153.193

Which works! To answer the question we had, you can see the MED value is set to 0 by the eqiad router, after the introduction of the new command, given the next-hop is local / there is no OSPF cost to the local LVS from that router. The MED on the route from codfw is instead learnt with MED 340, matching the configured OSPF cost on the cr2-codfw -> cr2-eqiad link. So the router in eqord picks the route learnt from eqiad due to lower MED, and there is no longer a discrepancy between BGP best route and outbound interface for next-hop.

I think this should solve the issue generally, and would advise we configure this command on all our BGP peerings over transport links. I can't think of any particular issue but would welcome comments / observations if there is something I've missed.

That sounds great! Let's test it out next week. Thanks.

Mentioned in SAL (#wikimedia-operations) [2021-06-10T10:47:19Z] <topranks> T283163: Adding "metric-out minimum-igp" to BGP group Confed_eqord on eqiad, codfw and eqdfw CRs.

Ok configuration has been added to cr1-eqiad, cr2-eqiad and cr2-codfw (routers with transport links to eqord).

Looks to have been successful. We are now announcing 208.80.154.0/23 from eqord which was the desired result. I will wait until Monday to make sure there is no unexpected fallout then apply this to all Confed_x BGP groups across the estate.

NOTE: Below output is from actual production routers.
BEFORE
cmooney@cr2-eqord> show route protocol bgp 208.80.154.224/32 detail 

inet.0: 834514 destinations, 2055768 routes (833531 active, 1 holddown, 1604 hidden)
Restart Complete
208.80.154.224/32 (3 entries, 1 announced)
        *BGP    Preference: 170/-101
                Next hop type: Indirect, Next hop index: 0
                Address: 0x7aa2d1ec
                Next-hop reference count: 12
                Source: 208.80.153.193
                Next hop type: Router, Next hop index: 774
                Next hop: 208.80.154.208 via xe-0/1/5.0, selected
                Session Id: 0xbeba
                Protocol next hop: 10.64.1.13
                Indirect next hop: 0x7d22440 1048614 INH Session ID: 0xbde5
                State: <Active Int Ext>
                Local AS: 65020 Peer AS: 65002
                Age: 3d 2:20:42 	Metric: 0 	Metric2: 242 
                Validation State: unverified 
                Task: BGP_65002.208.80.153.193
                Announcement bits (6): 0-KRT 5-Aggregate 6-RT 8-BGP_RT_Background 9-Resolve tree 1 10-Resolve tree 2 
                AS path: (65002 65001) 64600 I 
                Accepted
                Localpref: 100
                Router ID: 208.80.153.193
         BGP    Preference: 170/-101
                Next hop type: Indirect, Next hop index: 0
                Address: 0x7aa2d1ec
                Next-hop reference count: 12
                Source: 208.80.154.196
                Next hop type: Router, Next hop index: 774
                Next hop: 208.80.154.208 via xe-0/1/5.0, selected
                Session Id: 0xbeba
                Protocol next hop: 10.64.1.13
                Indirect next hop: 0x7d22440 1048614 INH Session ID: 0xbde5
                State: <NotBest Int Ext>
                Inactive reason: Not Best in its group - Router ID
                Local AS: 65020 Peer AS: 65001
                Age: 3d 2:20:42 	Metric: 0 	Metric2: 242 
                Validation State: unverified 
                Task: BGP_65001.208.80.154.196
                AS path: (65001) 64600 I 
                Accepted
                Localpref: 100
                Router ID: 208.80.154.196
         BGP    Preference: 170/-101
                Next hop type: Indirect, Next hop index: 0
                Address: 0x7aa2d1ec
                Next-hop reference count: 12
                Source: 208.80.154.197
                Next hop type: Router, Next hop index: 774
                Next hop: 208.80.154.208 via xe-0/1/5.0, selected
                Session Id: 0xbeba
                Protocol next hop: 10.64.1.13
                Indirect next hop: 0x7d22440 1048614 INH Session ID: 0xbde5
                State: <NotBest Int Ext>
                Inactive reason: Not Best in its group - Router ID
                Local AS: 65020 Peer AS: 65001
                Age: 3d 2:20:42 	Metric: 0 	Metric2: 242 
                Validation State: unverified 
                Task: BGP_65001.208.80.154.197
                AS path: (65001) 64600 I 
                Accepted
                Localpref: 100
                Router ID: 208.80.154.197

cmooney@cr2-eqord> 

cmooney@cr2-eqord> show route advertising-protocol bgp 208.115.136.231    

inet.0: 834514 destinations, 2055770 routes (833532 active, 0 holddown, 1604 hidden)
Restart Complete
  Prefix		  Nexthop	       MED     Lclpref    AS path
* 185.15.56.0/24          Self                                    I
* 198.73.209.0/24         Self                                    11820 ?
* 208.80.152.0/23         Self                                    I
AFTER
cmooney@cr2-eqord> show route protocol bgp 208.80.154.224/32 detail    

inet.0: 834436 destinations, 2055626 routes (833437 active, 18 holddown, 1604 hidden)
Restart Complete
208.80.154.224/32 (3 entries, 1 announced)
        State: <Record Pending>
        *BGP    Preference: 170/-101
                Next hop type: Indirect, Next hop index: 0
                Address: 0x7aa2d1ec
                Next-hop reference count: 13
                Source: 208.80.154.196
                Next hop type: Router, Next hop index: 774
                Next hop: 208.80.154.208 via xe-0/1/5.0, selected
                Session Id: 0xbeba
                Protocol next hop: 10.64.1.13
                Indirect next hop: 0x7d22440 1048614 INH Session ID: 0xbde5
                State: <Active Int Ext>
                Local AS: 65020 Peer AS: 65001
                Age: 3d 2:26:43 	Metric: 0 	Metric2: 242 
                Validation State: unverified 
                Task: BGP_65001.208.80.154.196
                Announcement bits (6): 0-KRT 5-Aggregate 6-RT 8-BGP_RT_Background 9-Resolve tree 1 10-Resolve tree 2 
                AS path: (65001) 64600 I 
                Accepted
                Localpref: 100
                Router ID: 208.80.154.196
         BGP    Preference: 170/-101
                Next hop type: Indirect, Next hop index: 0
                Address: 0x7aa2d1ec
                Next-hop reference count: 13
                Source: 208.80.154.197
                Next hop type: Router, Next hop index: 774
                Next hop: 208.80.154.208 via xe-0/1/5.0, selected
                Session Id: 0xbeba
                Protocol next hop: 10.64.1.13
                Indirect next hop: 0x7d22440 1048614 INH Session ID: 0xbde5
                State: <NotBest Int Ext>
                Inactive reason: Not Best in its group - Router ID
                Local AS: 65020 Peer AS: 65001
                Age: 3d 2:26:43 	Metric: 0 	Metric2: 242 
                Validation State: unverified 
                Task: BGP_65001.208.80.154.197
                AS path: (65001) 64600 I 
                Accepted
                Localpref: 100
                Router ID: 208.80.154.197
         BGP    Preference: 170/-101
                Next hop type: Indirect, Next hop index: 0
                Address: 0x7aa2d1ec
                Next-hop reference count: 13
                Source: 208.80.153.193
                Next hop type: Router, Next hop index: 774
                Next hop: 208.80.154.208 via xe-0/1/5.0, selected
                Session Id: 0xbeba
                Protocol next hop: 10.64.1.13
                Indirect next hop: 0x7d22440 1048614 INH Session ID: 0xbde5
                State: <NotBest Int Ext>
                Inactive reason: Not Best in its group - Route Metric or MED comparison
                Local AS: 65020 Peer AS: 65002
                Age: 2 	Metric: 342 	Metric2: 242 
                Validation State: unverified 
                Task: BGP_65002.208.80.153.193
                AS path: (65002 65001) 64600 I 
                Accepted
                Localpref: 100
                Router ID: 208.80.153.193



cmooney@cr2-eqord> show route advertising-protocol bgp 208.115.136.231    

inet.0: 834392 destinations, 2055551 routes (833409 active, 0 holddown, 1603 hidden)
Restart Complete
  Prefix		  Nexthop	       MED     Lclpref    AS path
* 185.15.56.0/24          Self                                    I
* 198.73.209.0/24         Self                                    11820 ?
* 208.80.152.0/23         Self                                    I
* 208.80.154.0/23         Self                                    I

Mentioned in SAL (#wikimedia-operations) [2021-06-14T10:51:59Z] <topranks> T283163: Adding "metric-out minimum-igp" to all internal/Confed BGP groups on CR routers.

This configuration has been rolled out now across all CR routers.

All looks ok, some slight increase in traffic in via eqord, and slight decrease in eqiad, but nothing massive that should cause concern. Will keep monitoring situation as day goes on, if no issues will close this task in a day or two.