It was discovered there is a gap in the logic for originating aggregate routes into BGP on our core routers. This was discovered on the back of a query from Bell Canada asking why we were not announcing IPv4 prefix 184.108.40.206/23 to them from eqord, but only from eqiad.
The current configuration is set up to create certain aggregate routes if any longer prefixes from within them are present in the BGP RIB. In our case it should mean that the presence of routes originated from LVS at remote sites will trigger the creation of the configured aggregates. Looking at the above prefix the relevant config on cr2-eqord is like this:
set routing-options aggregate route 220.127.116.11/23 policy BGP_from_LVS set policy-options policy-statement BGP_from_LVS term BGP_core_and_local_LVS from protocol bgp set policy-options policy-statement BGP_from_LVS term BGP_core_and_local_LVS from as-path core_and_local_LVS set policy-options policy-statement BGP_from_LVS term BGP_core_and_local_LVS then accept set policy-options policy-statement BGP_from_LVS then reject set policy-options as-path core_and_local_LVS "^(65002|65001)? 64600.*"
The one constraint on just creating the aggregate is the as-path regex in the last line. All LVS instances use ASN 64600, so the regex basically says "routes originated from LVS ASN, with either AS65002 (codfw) or AS65001 (eqiad) in the path, but not both". The intent of the configuration is to create the aggregate route only if it is being learnt directly from the site where it is being used (codfw or eqiad), but avoid doing so if it is being learnt from another site. The idea is we don't want to announce a route in Chicago for an Ashburn prefix if our network wants to get there via Dallas, for instance if our transport links from Chicago to Ashburn are down.
A problem can occur because of the interrelationship of BGP and OSPF, however, and how the BGP best-path algorithm works when confederations are in use. Consider the current BGP best-path for prefix 18.104.22.168/32 (origingated by lvs1013 in eqiad), on cr2-eqord:
cmooney@cr2-eqord> show route protocol bgp 22.214.171.124/32 terse inet.0: 833225 destinations, 2049355 routes (832285 active, 1 holddown, 1511 hidden) Restart Complete + = Active Route, - = Last Active, * = Both A V Destination P Prf Metric 1 Metric 2 Next hop AS path * ? 126.96.36.199/32 B 170 100 0 (65002 65001) 64600 I unverified >188.8.131.52 ? B 170 100 0 (65001) 64600 I unverified >184.108.40.206 ? B 170 100 0 (65001) 64600 I unverified >220.127.116.11
Local preference and MED are the same on all 3 of these routes. The router has ended up using the router-id attribute from each as tie-break, resulting in the one learnt from codfw being selected. Notably the fact that there are 2 sub-as's (65002 65001) in the path for this route, as opposed to only 1 (65001) on those learnt directly from eqiad, is not considered when selecting the best path. This is normal behaviour with BGP confederations, the sub-as path is not considered when comparing as-path length (RFC5065 5.3.3).
For our normal routing to this prefix it is not in any way an issue that BGP has selected the route that propagated through codfw to get to eqord. Whether learnt from codfw, or direct from eqiad, the next-hop in the BGP message is the IP of the originating LVS server in eqiad, 10.64.1.13:
cmooney@cr2-eqord> show route protocol bgp 18.104.22.168/32 detail | match "Source|Protocol Next Hop" Source: 22.214.171.124 Protocol next hop: 10.64.1.13 Source: 126.96.36.199 Protocol next hop: 10.64.1.13 Source: 188.8.131.52 Protocol next hop: 10.64.1.13
So regardless of which BGP route is selected, the same (indirect) next-hop IP is going to be used, and in normal circumstance traffic will route directly to eqiad due to lower IGP cost:
cmooney@cr2-eqord> show route 10.64.1.13 inet.0: 833306 destinations, 2049530 routes (832369 active, 0 holddown, 1493 hidden) Restart Complete + = Active Route, - = Last Active, * = Both 10.64.0.0/22 *[OSPF/10] 1w5d 21:47:37, metric 242 > to 184.108.40.206 via xe-0/1/5.0
Going back to the BGP policy on the aggregate route, however, there is an issue. Despite the traffic still routing directly to eqiad, the fact BGP has selected the route learnt from codfw means the as-path regex isn't matched. The regex only permits the BGP routes learnt directly from eqiad, neither of which aren't the selected best path:
cmooney@cr2-eqord> show route protocol bgp 220.127.116.11/32 aspath-regex "^(65002|65001)? 64600.*" inet.0: 833377 destinations, 2049658 routes (832435 active, 0 holddown, 1498 hidden) Restart Complete + = Active Route, - = Last Active, * = Both 18.104.22.168/32 [BGP/170] 1w5d 21:57:29, MED 0, localpref 100, from 22.214.171.124 AS path: (65001) 64600 I, validation-state: unverified > to 126.96.36.199 via xe-0/1/5.0 [BGP/170] 1w5d 21:57:29, MED 0, localpref 100, from 188.8.131.52 AS path: (65001) 64600 I, validation-state: unverified > to 184.108.40.206 via xe-0/1/5.0