Page MenuHomePhabricator

Use next-hop-self for iBGP sessions
Closed, ResolvedPublic

Description

As pointed out by @cmooney, using next-hop-self on advertisements between iBGP peers (core routers) should solve the Equinix Ashburn IXP issue (see T295650 and T293726#7454820).

Even if IX_PEER_A is directly reachable from cr1 and cr2, if it only peers with cr2, the prefixes will be sent from cr2 to cr1 with cr2's IP as next-hop (instead of IX_PEER_A's IP previously).
Another way to put it is that traffic will follow the BGP sessions.
Turning up peering with IX_PEER_A on cr1 will make that peer redundant as now cr1 will send traffic directly to that peer.

I can't think of any downside, but this needs to be carefully rolled out. Maybe de-pooling ulsfo then testing the change there.


https://wikitech.wikimedia.org/wiki/Incidents/2021-10-22_eqiad_return_path_timeouts
https://wikitech.wikimedia.org/wiki/Incidents/2021-11-23_Core_Network_Routing

Event Timeline

ayounsi created this task.
Restricted Application added a subscriber: Aklapper. · View Herald Transcript

Change 738899 had a related patch set uploaded (by Cathal Mooney; author: Cathal Mooney):

[operations/homer/public@master] Add policy-statement to CRs which sets next-hop self in iBGP.

https://gerrit.wikimedia.org/r/738899

Change 738899 merged by jenkins-bot:

[operations/homer/public@master] Add policy-statement to CRs which sets next-hop self in iBGP.

https://gerrit.wikimedia.org/r/738899

So of course there is a complication.

Currently we have a single BGP session between adjacent CR routers, peered over the loopback IPv4 addresses either side. Both IPv4 and IPv6 routes are exchanged over this single session.

That works fine as things are, when the next-hop for each prefix is unmodified (i.e. it remains as the external peer's address at an exchange or whatever). If we implement the "next-hop self" logic, however, we run into a problem.

It works fine for the IPv4 prefixes, but for IPv6 routes as the BGP session is built over IPv4 addresses, the BGP daemon does not know what IPv6 address to set. Even if there are IPv6 addresses on the interfaces involved (in this case lo0.0), BGP doesn't know anything about those. So it will send the route with a next-hop of
::ffff:<ipv4_addr>. Which the other CR receives, but has no route to and thus cannot use the route.

The solution to this issue is to create separate BGP peerings between CRs at a site, one between the IPv4 loopback addresses, and one between the IPv6 loopback addresses. And use the peering over each type of address to exchange only routes matching that address family. This is best practice anyway I believe.

Config

The config for this is still manual on our CRs, although it is on the agenda for automation. Given that is the case, after discussion with @ayounsi on irc, I think the best way forward is to make the minimal required changes to move to this situation. So that means keeping the same, single BGP group for the iBGP sessions, and moving the "address family" and "local address" statements into the terms for each neighbor definition.

For example, the eventual config for cr3-ulsfo, would be as follows:

policy-options {
    policy-statement iBGP_nh_self {
        /* Sets next-hop to router's own interface IP when announcing prefixes in iBGP - TT295672 */
        then {
            next-hop self;
        }
    }
}

protocols {
    bgp {
        group Confed_ulsfo {
            type internal;
            metric-out minimum-igp;
            import iBGP_rpki;
            export iBGP_nh_self;
            peer-as 65004;
            local-as 65004 no-prepend-global-as;
            neighbor 198.35.26.193 {
                local-address 198.35.26.192;
                family inet {
                    any;
                }
            }
            neighbor 2620:0:863:ffff::2 {
                local-address 2620:0:863:ffff::1;
                family inet6 {
                    any;
                }
            }
        }
    }
}
Implementation

I believe the following order of events is probably best to implement this, at a given site:

  1. Move the "local address", "family inet" and "family inet6" statements into the "neighbor" section for the existing IPv4 session in the local BGP group.
  2. Remove those statements from the global config for the local BGP group.
  3. Add the new neighbor definition for the IPv6 peering between loopbacks, only enabling this neighbor for "family inet6".
    • At this point we should have 2 sessions, 1 over IPv4 exchanging both types of routes, and 1 over IPv6 exchanging only v6 routes. Once we validate that the same set of IPv6 routes are being learnt over the v6 peering we can proceed.
  4. Remove the "family inet6" statement for the v4 neighbor, to cease exchanging IPv6 routes over the v4 addresses.
    • We need to validate routing tables and all looks ok at this stage.
  5. Add the "export iBGP_nh_self;" statement to the new IPv6 neighbor definition.
    • This should cause the IPv6 sent routes to have their next-hop changed to the loopback address of the routers either side. Logic of doing the IPv6 neighbor only / doing it first is that happy eyeballs should mostly mitigate problems in the unlikely scenario that something goes wrong.
  6. Remove the "export iBGP_nh_self;" statement from the IPv6 neighbor definition, and instead add it at the group level, so it affects both peerings
    • Again obviously we need to validate all looks ok.

@ayounsi could you review when you have time and advise if this sounds / looks ok? If we are agreed I can start making more detailed plans for the config at each site, and progress to implementation, using ulsfo as the initial/test site.

Also worth mentioning that from my limited understanding of the "apply-path" stuff I think the "loopback6" filter will allow the new IPv6 session without any changes. Thanks.

For the sake of completeness, another option could be to add the fffff:<v4> IP to the loopback address, but that would be more of a workaround than a long term solution.

Your plan looks good to me, and indeed for the apply-path. You can double check with:
cr3-ulsfo# show policy-options prefix-list bgp-sessions | display inheritance once the neighbor is added.

Out of scope, there is:

  • changing family X any to unicast (as you pointed it out on irc)
  • Automating that configuration statement
  • Adding BFD
  • Standardizing the group name across all sites (it's named iBGP in esams, possibly because it also includes knams)

Change 739479 had a related patch set uploaded (by Cathal Mooney; author: Cathal Mooney):

[operations/dns@master] Depool ulsfo to allow for safe reconfig of CR routers there

https://gerrit.wikimedia.org/r/739479

Change 739479 merged by Cathal Mooney:

[operations/dns@master] Depool ulsfo to allow for safe reconfig of CR routers there

https://gerrit.wikimedia.org/r/739479

Mentioned in SAL (#wikimedia-operations) [2021-11-17T10:14:08Z] <topranks> De-pool ulsfo in DNS to allow safe reconfiguration / test of changes to CR routers iBGP (T295672)

Mentioned in SAL (#wikimedia-operations) [2021-11-17T10:45:16Z] <topranks> Commencing manual config on cr3-ulsfo and cr4-ulsfo (site depooled) to reconfigure iBGP (T295672)

Mentioned in SAL (#wikimedia-operations) [2021-11-17T12:17:18Z] <topranks> Re-pooling ulsfo after completing routing changes on cr3-ulsfo and cr4-ulsfo (T295672)

Change went well in ulsfo earlier. De-pooled the site in DNS first and then proceeded with steps as outlined above.

All went as expected. Did take several minutes for iBGP to reconverge after adding the export-policy, during which time the CRs began using their local transit links to get to networks they previously using iBGP routes for (due to shorter AS-path learnt from adjacent CR etc.). So de-pool seems wise to avoid any active flows being affected by a change in return path during reconvergence.

That said both CRs were had full internet reachability throughout, just active path changed and then changed back during the move.

Looking at some random prefixes on cr3-ulsfo you can see how the change affected the routes learnt in BGP, but that ultimately the next-hop (route to other CR loopback in OSPF), didn't change.

I'll start planning to make the equivalent change in eqiad, after which we can re-enable the second Equinix IXP port there.

Before
cmooney@cr3-ulsfo> show route table inet.0 receive-protocol bgp 198.35.26.193 | last 10 | no-more 
  223.255.245.0/24        129.250.204.5        0       100        2914 6453 4755 45820 45954 45954 45954 45954 ?
  223.255.246.0/24        129.250.204.5        0       100        2914 6453 4755 45820 45954 45954 45954 45954 ?
  223.255.247.0/24        129.250.204.5        0       100        2914 6453 4755 45820 45954 45954 45954 45954 ?
  223.255.248.0/24        129.250.204.5        0       100        2914 174 63199 ?
  223.255.249.0/24        129.250.204.5        0       100        2914 174 63199 ?
  223.255.250.0/24        129.250.204.5        0       100        2914 174 63199 ?
  223.255.251.0/24        129.250.204.5        0       100        2914 174 63199 ?
  223.255.252.0/24        129.250.204.5        0       100        2914 4134 58519 ?
  223.255.253.0/24        129.250.204.5        0       100        2914 4134 58519 ?
* 223.255.254.0/24        198.32.176.113       0       250        7473 3758 55415 ?
cmooney@cr3-ulsfo> show route protocol bgp 223.255.254.0/24 exact detail 

inet.0: 858488 destinations, 2465043 routes (858366 active, 4 holddown, 213 hidden)
Restart Complete
223.255.254.0/24 (4 entries, 1 announced)
        *BGP    Preference: 170/-251
                Next hop type: Indirect, Next hop index: 0
                Address: 0x6d74341c
                Next-hop reference count: 40544
                Source: 198.35.26.193
                Next hop type: Router, Next hop index: 0
                Next hop: 198.35.26.197 via ae0.2 weight 0x1, selected
                Session Id: 0x0
                Next hop: 198.35.26.199 via et-0/0/1.401 weight 0xf000
                Session Id: 0x0
                Protocol next hop: 198.32.176.113
                Indirect next hop: 0x7d3a700 1048663 INH Session ID: 0x563a
                State: <Active Int Ext>
                Local AS: 65004 Peer AS: 65004
                Age: 12:56 	Metric: 0 	Metric2: 2 
                Validation State: unknown 
                Task: BGP_65004.198.35.26.193
                Announcement bits (6): 0-KRT 5-Aggregate 7-RT 9-BGP_RT_Background 10-Resolve tree 1 11-Resolve tree 2 
                AS path: 7473 3758 55415 ? 
                Communities: 3758:201 7473:10000 7473:12156 7473:12167 7473:12177 7473:12186 7473:12206 7473:12207 7473:12216 7473:12217 7473:12226 7473:12227 7473:12236 7473:12237 7473:20000 7473:41101 14907:3 unknown iana opaque 0x4300:0x0:0x1
                Accepted
                Localpref: 250
                Router ID: 198.35.26.193
cmooney@cr3-ulsfo> show route 223.255.254.0/24         

inet.0: 858486 destinations, 2465031 routes (858368 active, 0 holddown, 213 hidden)
Restart Complete
+ = Active Route, - = Last Active, * = Both

223.255.254.0/24   *[BGP/170] 00:13:24, MED 0, localpref 250, from 198.35.26.193
                      AS path: 7473 3758 55415 ?, validation-state: unknown
                    > to 198.35.26.197 via ae0.2
                      to 198.35.26.199 via et-0/0/1.401
After
cmooney@cr3-ulsfo> show route table inet.0 receive-protocol bgp 198.35.26.193 | last 10 | no-more    
  223.255.245.0/24        198.35.26.193        0       100        2914 6453 4755 45820 45954 45954 45954 45954 ?
  223.255.246.0/24        198.35.26.193        0       100        2914 6453 4755 45820 45954 45954 45954 45954 ?
  223.255.247.0/24        198.35.26.193        0       100        2914 6453 4755 45820 45954 45954 45954 45954 ?
  223.255.248.0/24        198.35.26.193        0       100        2914 174 63199 ?
  223.255.249.0/24        198.35.26.193        0       100        2914 174 63199 ?
  223.255.250.0/24        198.35.26.193        0       100        2914 174 63199 ?
  223.255.251.0/24        198.35.26.193        0       100        2914 174 63199 ?
  223.255.252.0/24        198.35.26.193        0       100        2914 4134 58519 ?
  223.255.253.0/24        198.35.26.193        0       100        2914 4134 58519 ?
* 223.255.254.0/24        198.35.26.193        0       250        7473 3758 55415 ?
cmooney@cr3-ulsfo> show route protocol bgp 223.255.254.0/24 exact detail                             

inet.0: 858429 destinations, 2464894 routes (858314 active, 4 holddown, 173 hidden)
Restart Complete
223.255.254.0/24 (4 entries, 1 announced)
        *BGP    Preference: 170/-251
                Next hop type: Indirect, Next hop index: 0
                Address: 0x7892530c
                Next-hop reference count: 655496
                Source: 198.35.26.193
                Next hop type: Router, Next hop index: 0
                Next hop: 198.35.26.197 via ae0.2 weight 0x1, selected
                Session Id: 0x0
                Next hop: 198.35.26.199 via et-0/0/1.401 weight 0xf000
                Session Id: 0x0
                Protocol next hop: 198.35.26.193
                Indirect next hop: 0x7d15b00 1048582 INH Session ID: 0x21e4f
                State: <Active Int Ext>
                Local AS: 65004 Peer AS: 65004
                Age: 3:09 	Metric: 0 	Metric2: 2 
                Validation State: unknown 
                Task: BGP_65004.198.35.26.193
                Announcement bits (6): 0-KRT 5-Aggregate 7-RT 9-BGP_RT_Background 10-Resolve tree 1 11-Resolve tree 2 
                AS path: 7473 3758 55415 ? 
                Communities: 3758:201 7473:10000 7473:12156 7473:12167 7473:12177 7473:12186 7473:12206 7473:12207 7473:12216 7473:12217 7473:12226 7473:12227 7473:12236 7473:12237 7473:20000 7473:41101 14907:3 unknown iana opaque 0x4300:0x0:0x1
                Accepted
                Localpref: 250
                Router ID: 198.35.26.193
cmooney@cr3-ulsfo> show route 223.255.254.0/24                                                       

inet.0: 858462 destinations, 2464953 routes (858350 active, 1 holddown, 173 hidden)
Restart Complete
+ = Active Route, - = Last Active, * = Both

223.255.254.0/24   *[BGP/170] 00:03:24, MED 0, localpref 250, from 198.35.26.193
                      AS path: 7473 3758 55415 ?, validation-state: unknown
                    > to 198.35.26.197 via ae0.2
                      to 198.35.26.199 via et-0/0/1.401
cmooney@cr3-ulsfo> ping 223.255.254.253 source 198.35.26.192   
PING 223.255.254.253 (223.255.254.253): 56 data bytes
64 bytes from 223.255.254.253: icmp_seq=0 ttl=243 time=191.794 ms
64 bytes from 223.255.254.253: icmp_seq=1 ttl=243 time=183.523 ms
^C
--- 223.255.254.253 ping statistics ---
2 packets transmitted, 2 packets received, 0% packet loss
round-trip min/avg/max/stddev = 183.523/187.659/191.794/4.135 ms
cmooney@cr3-ulsfo> traceroute 223.255.254.253 source 198.35.26.192 no-resolve wait 1 
traceroute to 223.255.254.253 (223.255.254.253) from 198.35.26.192, 30 hops max, 52 byte packets
 1  198.35.26.197  0.675 ms  0.896 ms  0.831 ms
 2  198.32.176.50  1.155 ms  1.068 ms  1.079 ms
 3  203.208.171.185  183.938 ms  183.986 ms 203.208.172.233  1.604 ms
     MPLS Label=338769 CoS=0 TTL=1 S=1
 4  203.208.158.45  181.892 ms 203.208.182.249  186.383 ms 203.208.151.217  227.053 ms
     MPLS Label=1188 CoS=0 TTL=1 S=1
 5  203.208.166.234  180.477 ms 203.208.182.250  180.368 ms 203.208.166.234  167.959 ms
 6  203.208.158.190  174.684 ms 203.208.177.218  181.019 ms 203.208.153.186  175.172 ms
 7  165.21.138.89  184.554 ms 165.21.138.93  179.388 ms 165.21.138.89  174.240 ms
 8  165.21.138.69  181.332 ms 203.208.177.218  180.365 ms 165.21.138.69  187.976 ms
 9  165.21.138.93  180.632 ms 165.21.138.134  183.236 ms 165.21.138.130  182.925 ms
10  128.106.31.174  181.198 ms  203.509 ms  168.754 ms
11  * * 165.21.138.130  179.629 ms
12  128.106.31.174  175.733 ms  175.627 ms  169.306 ms
13  * * *
14  * * *
15  * * *

Change 739601 had a related patch set uploaded (by Cathal Mooney; author: Cathal Mooney):

[operations/puppet@production] Modifying globbing for partman recipie for rpki VMs

https://gerrit.wikimedia.org/r/739601

Change 739601 merged by Cathal Mooney:

[operations/puppet@production] Modifying globbing for partman recipie for rpki VMs

https://gerrit.wikimedia.org/r/739601

Change 739703 had a related patch set uploaded (by Cathal Mooney; author: Cathal Mooney):

[operations/dns@master] Depool eqiad at DNS level to faciliate iBGP reconfig on CRs

https://gerrit.wikimedia.org/r/739703

Change 739703 merged by Cathal Mooney:

[operations/dns@master] Depool eqiad at DNS level to faciliate iBGP reconfig on CRs

https://gerrit.wikimedia.org/r/739703

Mentioned in SAL (#wikimedia-operations) [2021-11-18T08:01:40Z] <topranks> Depooling eqiad in authdns to allow for reconfiguration of CR routers on site (T295672)

Mentioned in SAL (#wikimedia-operations) [2021-11-18T10:12:53Z] <topranks> Re-pooling eqiad in DNS after completing iBGP policy changes on cr1-eqiad and cr2-eqiad T295672

Change completed successfully in eqiad.

Before
cmooney@re0.cr1-eqiad> show route receive-protocol bgp 208.80.154.197             

inet.0: 860268 destinations, 3385014 routes (859884 active, 2 holddown, 2248 hidden)
Restart Complete
  Prefix		  Nexthop	       MED     Lclpref    AS path
* 1.0.0.0/24              206.126.237.30       0       280        13335 ?
* 1.0.4.0/22              206.126.237.175      0       250        4826 38803 ?
* 1.0.4.0/24              206.126.237.175      0       250        4826 38803 ?
* 1.0.5.0/24              206.126.237.175      0       250        4826 38803 ?
* 1.0.6.0/24              206.126.237.175      0       250        4826 38803 ?
* 1.0.7.0/24              206.126.237.175      0       250        4826 38803 ?
* 1.0.64.0/18             206.126.237.239      0       250        4637 7670 18144 ?
* 1.0.128.0/17            206.126.236.73       0       250        1273 38040 23969 ?
* 1.0.128.0/18            206.126.236.6        0       250        6762 38040 23969 ?
* 1.0.128.0/19            206.126.236.6        0       250        6762 38040 23969 ?
* 1.0.128.0/24            206.126.237.239      0       250        4637 4651 23969 ?
* 1.0.129.0/24            206.126.236.6        0       250        6762 38040 23969 ?
cmooney@re0.cr1-eqiad> show route protocol bgp 1.0.0.0/24 detail             

inet.0: 860270 destinations, 3385010 routes (859886 active, 2 holddown, 2248 hidden)
Restart Complete
1.0.0.0/24 (5 entries, 1 announced)
        *BGP    Preference: 170/-281
                Next hop type: Indirect, Next hop index: 0
                Address: 0x1cc83b8c
                Next-hop reference count: 3584
                Source: 208.80.154.197
                Next hop type: Router, Next hop index: 0
                Next hop: 208.80.154.194 via ae0.0 weight 0x1, selected
                Session Id: 0x0
                Next hop: 208.80.153.215 via xe-4/2/2.12 weight 0xf000
                Session Id: 0x0
                Protocol next hop: 206.126.237.30
                Indirect next hop: 0x7934080 1048751 INH Session ID: 0x2f21b
                State: <Active Int Ext>
                Local AS: 65001 Peer AS: 65001
                Age: 3d 20:22:47 	Metric: 0 	Metric2: 3 
                Validation State: valid 
                Task: BGP_65001.208.80.154.197+179
                Announcement bits (6): 0-KRT 5-Aggregate 7-RT 9-BGP_RT_Background 10-Resolve tree 1 12-Resolve tree 2 
                AS path: 13335 ? 
                Aggregator: 13335 172.70.32.1
                Communities: 13335:10350 13335:19000 13335:20050 13335:20500 13335:20530 14907:3 14907:12 unknown iana opaque 0x4300:0x0:0x0
                Accepted
                Localpref: 280
                Router ID: 208.80.154.197
cmooney@re0.cr1-eqiad> show route 1.0.0.0/24                        

inet.0: 860263 destinations, 3385006 routes (859880 active, 1 holddown, 2248 hidden)
Restart Complete
+ = Active Route, - = Last Active, * = Both

1.0.0.0/24         *[BGP/170] 3d 20:23:08, MED 0, localpref 280, from 208.80.154.197
                      AS path: 13335 ?, validation-state: valid
                    > to 208.80.154.194 via ae0.0
                      to 208.80.153.215 via xe-4/2/2.12
After
cmooney@re0.cr1-eqiad> show route receive-protocol bgp 208.80.154.197  

inet.0: 860427 destinations, 3570864 routes (860040 active, 136406 holddown, 2249 hidden)
Restart Complete
  Prefix		  Nexthop	       MED     Lclpref    AS path
* 1.0.0.0/24              208.80.154.197       0       280        13335 ?
* 1.0.4.0/22              208.80.154.197       0       250        4826 38803 ?
* 1.0.4.0/24              208.80.154.197       0       250        4826 38803 ?
* 1.0.5.0/24              208.80.154.197       0       250        4826 38803 ?
* 1.0.6.0/24              208.80.154.197       0       250        4826 38803 ?
* 1.0.7.0/24              208.80.154.197       0       250        4826 38803 ?
* 1.0.64.0/18             208.80.154.197       0       250        4637 7670 18144 ?
* 1.0.128.0/17            208.80.154.197       0       250        1273 38040 23969 ?
* 1.0.128.0/18            208.80.154.197       0       250        6762 38040 23969 ?
* 1.0.128.0/19            208.80.154.197       0       250        6762 38040 23969 ?
* 1.0.128.0/24            208.80.154.197       0       250        4637 4651 23969 ?
* 1.0.129.0/24            208.80.154.197       0       250        6762 38040 23969 ?
cmooney@re0.cr1-eqiad> show route protocol bgp 1.0.0.0/24 detail 

inet.0: 860390 destinations, 3519959 routes (860007 active, 1 holddown, 2247 hidden)
Restart Complete
1.0.0.0/24 (5 entries, 1 announced)
        *BGP    Preference: 170/-281
                Next hop type: Indirect, Next hop index: 0
                Address: 0x4d20bac
                Next-hop reference count: 1254602
                Source: 208.80.154.197
                Next hop type: Router, Next hop index: 0
                Next hop: 208.80.154.194 via ae0.0 weight 0x1, selected
                Session Id: 0x0
                Next hop: 208.80.153.215 via xe-4/2/2.12 weight 0xf000
                Session Id: 0x0
                Protocol next hop: 208.80.154.197
                Indirect next hop: 0x792cb40 1048656 INH Session ID: 0x30f09
                State: <Active Int Ext>
                Local AS: 65001 Peer AS: 65001
                Age: 5:55 	Metric: 0 	Metric2: 3 
                Validation State: valid 
                Task: BGP_65001.208.80.154.197+179
                Announcement bits (6): 0-KRT 5-Aggregate 7-RT 9-BGP_RT_Background 10-Resolve tree 1 12-Resolve tree 2 
                AS path: 13335 ? 
                Aggregator: 13335 172.70.32.1
                Communities: 13335:10350 13335:19000 13335:20050 13335:20500 13335:20530 14907:3 14907:12 unknown iana opaque 0x4300:0x0:0x0
                Accepted
                Localpref: 280
                Router ID: 208.80.154.197
cmooney@re0.cr1-eqiad> show route 1.0.0.0/24 

inet.0: 860385 destinations, 3519971 routes (860003 active, 0 holddown, 2247 hidden)
Restart Complete
+ = Active Route, - = Last Active, * = Both

1.0.0.0/24         *[BGP/170] 00:05:40, MED 0, localpref 280, from 208.80.154.197
                      AS path: 13335 ?, validation-state: valid
                    > to 208.80.154.194 via ae0.0
                      to 208.80.153.215 via xe-4/2/2.12

I'll move on to do the other sites next week.

So we had some unexpected consequences over the weekend following this change.

Example mail from ISP below:

> Cc'ing Wikimedia NOC.
>
> We have a user who is seeing issues reaching 208.80.154.224 from at least 216.116.128.0/24, as you can see from the traces below.
>
> We readvertise this prefix to you, and everything appears as expected on our side, as we hand off across the Equinix Ashburn IX, and we see no problems reaching you from our space.  Is there possibly something on the return path, RPF, or some ACL which you see on your side which might be causing this?  Their example traces below, and they're cc'd on this thread.
>

And some checks on this route showing the routing loop our side and BGP prefix learnt on cr2-eqiad for it:

https://phabricator.wikimedia.org/P17784

In brief, cr2-eqiad was sending packets for affected prefixes like 216.116.128.0/24 to cr1-eqiad. cr1-eqiad was learning those routes from cr2-eqord over a multi-hop BGP session, however it's path to get to cr2-eqord was via cr2-eqiad.

cr2-eqiad was selecting the route learnt from cr1-eqiad as it had lower IGP cost to the next-hop. That makes sense as they are on the same site, OSPF cost is 3 from cr2-eqiad to cr1-eqiad's loopback (which became the next-hop following change on Thursday morning). OSPF cost from cr2-eqiad to cr2-eqord, by contrast, is 240.

But that said, cr1-eqiad is configured with this statement:

set protocols bgp group Confed_eqiad metric-out minimum-igp

Which should mean that the MED attribute on eqord BGP routes cr1-eqiad sends to cr2-eqiad would be set to the OSPF cost to the BGP next-hop (i.e. to cr2-eqord loopback). That IGP cost is 243 (3 to cr2-eqiad + 240 to cr2-eqord), so if it were copied to the MED as expected the route would not be used. I checked on Thursday during the change that this was happening and it appeared to be, MED is 243 on example routes here:

cmooney@re0.cr2-eqiad> show route table inet6.0 receive-protocol bgp 2620:0:861:ffff::1    

inet6.0: 138824 destinations, 532205 routes (138182 active, 4 holddown, 3296 hidden)
Restart Complete
  Prefix		  Nexthop	       MED     Lclpref    AS path
  2001:200::/32           2620:0:861:ffff::1   243     100        (65020) 2914 2500 2500 ?
  2001:200:900::/40       2620:0:861:ffff::1   243     100        (65020) 2914 2907 2907 7660 ?

Going back to Arzhel's paste line 90 shows the metric for the protocol next-hop cr1 saw was indeed 243.

However line 242 shows that this route was received on cr2-eqiad with a MED of 0 from cr1-eqiad, rather than the expected 243. So it was used.

The solution was to shut down the multi-hop session from cr1-eqiad to cr2-eqord. As a result cr1-eqiad learns eqord routes from cr2-eqiad, rather than directly, and doesn't reflect them back to cr2-eqiad again (iBGP routes are not propagated).

We need to dig further to understand why this happened. It is likely that the multi-hop session is not advantageous, but there are lots of things to consider here so more thought more testing is required. For now traffic patterns are normal and it is not causing any issue to leave the cr1-eqiad to cr2-eqord BGP session shut down.

I want to get to the bottom of this before rolling this change out any further.

Ok to try to get more clarity on the situation I briefly re-enabled the cr1-eqiad to cr2-eqord BGP session. But despite this I am not really seeing that is going on.

Consider these two routes learnt on cr2-eqiad while that session was up. Both are learnt in iBGP from cr1, both have been learnt by cr1 from eqord, but one has a MED of 243 and one does not (and thus one is used and one is not).

  Prefix                  Nexthop              MED     Lclpref    AS path
  1.32.215.0/24           208.80.154.196       243     100        (65020) 2914 64050 ?
* 1.32.218.0/24           208.80.154.196       0       100        (65020) 2914 64050 ?

I can see, looking at the routes on cr1-eqiad, that the MED attribute is 243 for the first route, and it is 0 for the second one . Perhaps I am missing it but I can't see any other significant difference in these routes that is causing this.

Either way, it's clear the multi-hop session between cr1-eqiad and cr2-eqord cannot co-exist with the "next-hop self" setting on the iBGP peering between cr1-eqiad and cr2-eqiad. As the latter is needed to support connection to the same Equinix IXP LAN from both these routers I think we need to keep that shut down.

Earlier in the week I attempted to remove the "metric-out minimum-igp" from the iBGP session between cr1-eqiad and cr2-eqiad, resulting in a routing-loop on the network which affected reachability from eqiad to BGP routes learnt from remote sites. To my knowledge the problem didn't affect any user services, as the traffic between satellite sites and eqiad required to service user requests flows between IPs that are learnt via OSPF, and the issue only affected BGP.

Evaluation of what happened

The issue occured for a few reasons:

  1. When the "next-hop self" change done on Thurs Nov 18th, the OSPF cost to reach the BGP next hop changed to 2 (direct link to other CR).
  2. But... the presence of the "set metric minimum-igp" on these peerings meant that the MED was getting used as tie-break, so routers still selected the correct external path to remote BGP prefixes, rather than iBGP learnt routes.
  3. When that was removed (with the aim of equalising the cost of routes between cr1-eqiad and cr2-eqiad to balance traffic), remote BGP prefixes were suddenly de-selected.
    1. The MED being set on routes learnt from remote sites was still very high (WAN peerings still had "minimum IGP" both sides)
    2. But the MED was 0 on routes exchanged between the local CRs, so the propagated routes in iBGP won.

Clearly this was predictable, and a higher-level of thinking / pre-testing / simulation should have been carried out before progressing with that change. Lessons learnt for the future.

Proposed way forward

To address the specific requirement of our two connections to the Equinix IXP from alternate routers in eqiad, the "next-hop self" approach is definitely still the best way to proceed. It is also probably desirable to have the local CR router's loopback IP as the next-hop for all external routes learnt locally. i.e. from peering/transit or indeed PyBal/Anycast/K8s servers.

Given that is the case I would propose to revise the iBGP export policy as follows.

Firstly, remove the command to set the MED on routes exchanged in iBGP from the group definition itself. This is not advantageous for local PyBal routes, where the MED is set by the servers themselves:

delete  protocols bgp group Confed_eqiad metric-out minimum-igp;

In order to discriminate between routes learnt from other WMF sites and those learnt from local BGP sessions, create a new as-path definition on CRs, matching the ASN range we use BGP sub-as's:

set policy-options as-path-group REMOTE_SITES as-path REMOTE_SITES "^[65001-65099] .*"

We remove the "iBGP_nh_self" export policy configured for iBGP peers, and replace it with this:

policy-statement IBGP-OUT {
    term REMOTE_SITE_BGP {
        from as-path-group REMOTE_SITES;
        then {
            metric {
                igp;
            }
            accept;
        }
    }
    term LOCAL_SITE_BGP {
        then {
            next-hop self;
        }
    }
}

This policy has 2 effects:

  1. The MED is only set to the OSPF cost for routes learnt from remote sites, and the next-hop is left untouched for those.
    1. NOTE "igp" is used instead of "minimum-igp" so OSPF cost changes up or down cause the MED to change.
      1. The property of "minimum-igp", whereby it only adjusts a previously announced MED if the change would reduce it, slowed the fix to the routing-loop issue.
      2. Using "igp" instead means it will always update MED based on IGP cost changes.
  2. All other routes, i.e. learnt from local sessions at the site, have the next-hop rewitten to the CR's own loopback.
Test Setup

To validate this approach I've tested the setup on my local machine using vMX virtual machines.

A simple topology was used in GNS3, as shown in the below diagram:

lab_topo.png (876×781 px, 93 KB)

Ignore the dotted lines they are for mgmt SSH and not relevant. I also tested with more WAN devices simulating codfw and ulsfo also, but to keep this explanation simple let's just consider these nodes.

IPv4 loopback IPs were set up as follows:

cr1-eqiad: 1.1.1.1/32
cr2-eqiad: 2.2.2.2/32
cr2-eqord: 3.3.3.3/32

Additionally all simulated "transport" links between devices were addressed with IPv4 subnets from the 10.0.0.0/8 network. And all networks with eBGP peerings (to either 'IXP' / internet routers, or local simulated LVS servers) were taken from 198.0.0.0/8. A handful of test routes were announced with eBGP from the simulated "IXP" and "LVS" nodes. OSPF and the BGP confederation was configured between the three simulated CRs as it is in production.

Evaluating situation with no export policy on iBGP

So in the first case let's look at the BGP routes learnt on cr1-eqiad from cr2-eqiad, without any policy applied to the iBGP session. Note also the multi-hop BGP session from cr1-eqiad to cr2-eqord is still disabled.

root@CR1> show route receive-protocol bgp 2.2.2.2    

inet.0: 19 destinations, 20 routes (19 active, 0 holddown, 0 hidden)
  Prefix                  Nexthop              MED     Lclpref    AS path
  1.2.3.4/32              198.1.2.1            0       100        64600 ?
* 6.7.8.9/32              198.1.4.2            0       100        (65020) 64600 ?
* 8.8.4.4/32              198.1.3.2            0       100        (65020) 15169 ?
* 8.8.8.8/32              198.1.1.1            0       100        15169 ?

Let's assess each route:

1.2.3.4/32: This is an LVS VIP IP announced locally at the site. It's not selected as cr1 is learning it from its own direct BGP session to the LVS:

root@CR1> show route protocol bgp 1.2.3.4/32 exact 

inet.0: 19 destinations, 20 routes (19 active, 0 holddown, 0 hidden)
+ = Active Route, - = Last Active, * = Both

1.2.3.4/32         *[BGP/170] 03:13:36, MED 0, localpref 100
                      AS path: 64600 ?, validation-state: unverified
                    > to 198.1.2.1 via ge-0/0/1.0
                    [BGP/170] 00:06:00, MED 0, localpref 100, from 2.2.2.2
                      AS path: 64600 ?, validation-state: unverified
                    > to 198.1.2.1 via ge-0/0/1.0

6.7.8.9/32: This is an LVS IP learnt from eqord, as can be seen from the AS-path. It's next-hop is the LVS IP in eqord itself, which it knows from OSPF:

root@CR1> show route 6.7.8.9/32 exact 

inet.0: 19 destinations, 20 routes (19 active, 0 holddown, 0 hidden)
+ = Active Route, - = Last Active, * = Both

6.7.8.9/32         *[BGP/170] 00:08:44, MED 0, localpref 100, from 2.2.2.2
                      AS path: (65020) 64600 ?, validation-state: unverified
                    > to 10.10.10.2 via ge-0/0/3.0

It's path to this is through ge-0/0/3, which is the link to cr2-eqiad as expected:

root@CR1> show lldp neighbors 
Local Interface    Parent Interface    Chassis Id          Port info          System Name
ge-0/0/3           -                   2c:6b:f5:15:05:c0   529                CR2

8.8.4.4/32: This is a simulated internet route, learnt from external ASN 15169 in eqord.

As with the LVS announced IP from eqord you can see the eqord AS in the path, and also the link address is the eBGP peer address in eqord. The next-hop interface is the same as the last route which was also from eqord:

root@CR1> show route 8.8.4.4/32 exact 

inet.0: 19 destinations, 20 routes (19 active, 0 holddown, 0 hidden)
+ = Active Route, - = Last Active, * = Both

8.8.4.4/32         *[BGP/170] 00:38:04, MED 0, localpref 100, from 2.2.2.2
                      AS path: (65020) 15169 ?, validation-state: unverified
                    > to 10.10.10.2 via ge-0/0/3.0

8.8.8.8/32: This is a simulated internet route, learnt from external ASN 15169 on cr2-eqiad.

This one is designed to simulate our dual-connection to the Equinix IXP. You can see the BGP route from the other CR is being used, but as the next-hop is reachable on it's own local connection to the peering LAN it is not going to send this traffic to cr2:

root@CR1> show route 8.8.8.8/32 exact      

inet.0: 19 destinations, 20 routes (19 active, 0 holddown, 0 hidden)
+ = Active Route, - = Last Active, * = Both

8.8.8.8/32         *[BGP/170] 00:15:26, MED 0, localpref 100, from 2.2.2.2
                      AS path: 15169 ?, validation-state: unverified
                    > to 198.1.1.1 via ge-0/0/5.0
root@CR1> show interfaces descriptions | match ge-0/0/5 
ge-0/0/5        up    up   Equinix IXP eth1

This route exhibits the potential problem we seen in T295650.

Evaluating situation with the new iBGP export policy applied

Let's look how the routes on CR1 change after the new export policy is applied:

root@CR1> show route receive-protocol bgp 2.2.2.2          

inet.0: 19 destinations, 20 routes (19 active, 0 holddown, 0 hidden)
  Prefix                  Nexthop              MED     Lclpref    AS path
  1.2.3.4/32              2.2.2.2              0       100        64600 ?
* 6.7.8.9/32              198.1.4.2            250     100        (65020) 64600 ?
* 8.8.4.4/32              198.1.3.2            250     100        (65020) 15169 ?
* 8.8.8.8/32              2.2.2.2              0       100        15169 ?

The two main changes are immediately visible. The routes that are not learnt from the remote site have the next-hop set to the loopback of cr2-eqiad. Additionally the MED on the remote routes now reflects the OSPF cost that cr2 sees to the next-hop in eqord.

You can see the 8.8.8.8 route is going to now go via the link to cr2, and route across the IXP peering LAN from that router as needed.

Both of the eqord routes have an unmodified next-hop, so as previously they will route via the direct link to CR2 to get to those eqord next-hops (known from OPSF)

root@CR1> show route 6.7.8.9/32 exact   

inet.0: 19 destinations, 20 routes (19 active, 0 holddown, 0 hidden)
+ = Active Route, - = Last Active, * = Both

6.7.8.9/32         *[BGP/170] 00:04:35, MED 250, localpref 100, from 2.2.2.2
                      AS path: (65020) 64600 ?, validation-state: unverified
                    > to 10.10.10.2 via ge-0/0/3.0
root@CR1> show route 8.8.4.4/32 exact 

inet.0: 19 destinations, 20 routes (19 active, 0 holddown, 0 hidden)
+ = Active Route, - = Last Active, * = Both

8.8.4.4/32         *[BGP/170] 00:04:41, MED 250, localpref 100, from 2.2.2.2
                      AS path: (65020) 15169 ?, validation-state: unverified
                    > to 10.10.10.2 via ge-0/0/3.0

At this point we can re-enable the multi-hop BGP session between cr1-eqiad and cr2-eqord, and not upset these paths.

root@CR1> show route 6.7.8.9/32 exact    

inet.0: 19 destinations, 22 routes (19 active, 0 holddown, 0 hidden)
+ = Active Route, - = Last Active, * = Both

6.7.8.9/32         *[BGP/170] 00:00:06, MED 0, localpref 100, from 3.3.3.3
                      AS path: (65020) 64600 ?, validation-state: unverified
                    > to 10.10.10.2 via ge-0/0/3.0
                    [BGP/170] 00:06:17, MED 250, localpref 100, from 2.2.2.2
                      AS path: (65020) 64600 ?, validation-state: unverified
                    > to 10.10.10.2 via ge-0/0/3.0

As can be seen cr1-eqiad is now selecting the BGP route it learns direct from cr2-eqord. But as the next-hop on this is the same as cr2-eqiad was announcing to it the actual path doesn't change. It selects the route from cr2-eqord as it has a lower MED (cr2-eqord is configured to set the MED to the IGP metric when it announces to cr1-eqiad, but as it is directly connected that is 0).

Conclusion

I think this is the best option for now, and a safe choice which minimizes disruption elsewhere.

Longer term I think we do need to review our core/WAN routing setup, and explore if our options. I still suspect the multi-hop BGP sessions are not needed, and also that the "metric-out minimum igp", while it fixes T283163, might add additional complexity we may not want. But for now I think we can leave all that as is, in order not to disrupt more than is required without doing a more in depth review.

@ayounsi if you have some time Monday let's discuss and agree the best way forward.

Change 742462 had a related patch set uploaded (by Cathal Mooney; author: Cathal Mooney):

[operations/puppet@production] Add drmrs loopbacks and interconnect range to ntp allowed config

https://gerrit.wikimedia.org/r/742462

^^ ignore above - pasted wrong task ID. and sorry for spam.

As a general note we need to be careful with rolling out config fixes in reaction to unexpected issues.
Even if it's thoroughly tested and I agree with your thorough proposal, it increases the config's complexity by tiny increments, making future changes (small or big) more risky.
As you pointed out, looking at our BGP confederation holistically is long due! (partially with T167841, possibly after looking at OSPF with T200277 to have sound foundations).

The alternatives I can think about are:

  • Keep the 2nd Equinix port disabled (current status-quo)
  • Move the 2nd Equinix port to cr2, configured as a LAG (requires re-config from us and from Equinix)

The first one would only be if we were to focus on BGP as a whole in the (very) short term (eg. 1-2 months).

But as in T290877 we were hesitating between configuring a LAG and splitting it to two distinct routers we should probably weight the latter against your fix.

To be clear, I agree that your proposal is a good solution however I'm wondering what's most future-proof.

ayounsi changed the task status from Open to In Progress.Dec 7 2021, 8:04 AM

To be clear, I agree that your proposal is a good solution however I'm wondering what's most future-proof.

As you said we need a thorough review, and I've got this on my task list for next quarter (obviously we will need to collaborate). This I would see as an interim config to allow us use the second Equinix port until then. I'd suspect whatever came out of the review iBGP prefixes announced at a given site would continue to use next-hop self, but obviously can't say for sure.

Keep the 2nd Equinix port disabled (current status-quo)

The 2nd Equinix port is up right now and has been for the past few weeks. However I feel that relying on MED to select best path from eqiad to remote prefixes (as is currently the state, given IGP cost to them is 2 due to current next-hop self config), is not desirable. This change would address that.

Move the 2nd Equinix port to cr2, configured as a LAG (requires re-config from us and from Equinix)

I'm not a great fan of the LAG idea, or putting both links on one router. So I'd recommend we either implement this and keep the second Equinix port up, or shut it down, revert the configuration to what it was before any of these changes began (i.e. remove next-hop self on the ibgp), and wait until the more thorough review / longer term plan is clear.

Ok, the fix from T295672#7531535 sounds good to me then!

Change 745218 had a related patch set uploaded (by Cathal Mooney; author: Cathal Mooney):

[operations/homer/public@master] Updated iBGP policy to process local and remote routes differently.

https://gerrit.wikimedia.org/r/745218

Change 745218 merged by Cathal Mooney:

[operations/homer/public@master] Updated iBGP policy to process local and remote routes differently.

https://gerrit.wikimedia.org/r/745218

Mentioned in SAL (#wikimedia-operations) [2021-12-09T11:19:18Z] <topranks> Changing export policy applied on eqiad CRs for local confed to not rewrite next-hop for routes learnt from other WMF POPs (T295672)

Mentioned in SAL (#wikimedia-operations) [2021-12-09T11:44:16Z] <topranks> Re-enabling multihop BGP session from cr1-eqiad to cr2-eqord (T295672)

Mentioned in SAL (#wikimedia-operations) [2021-12-09T12:00:08Z] <topranks> Changing export policy applied on ulsfo CRs for local confed to not rewrite next-hop for routes learnt from other WMF POPs (T295672)

cmooney lowered the priority of this task from High to Low.Dec 9 2021, 12:20 PM

Ok I have applied the changes on cr1-eqiad and cr2-eqiad. All went as expected.

Paste with some before/after checks showing differences in routing tables here: P18080

Usage on transport links to/from eqiad remains the same, as expected.

I also re-enabled the multihop BGP session from cr1-eqiad to cr2-eqord, which had been down since shortly after the initial change, due to a routing loop present with it up and cr1-eqiad setting itself as next-hop for those routes when announcing to cr2. Before/After checks here: P18081

Lastly I've made the same adjustment on cr3-ulsfo and cr4-ulsfo. This is the only other site that had the "next-hop self" config change made prior to the issues in eqiad. It now is in sync with the setup in eqiad. Some checks here: P18082

Next steps

The question remains on whether to update all other CRs with this config. It ought not to materially affect flows, although iBGP routes learnt at a site, propagated to another, would have loopback of router doing that rather than link subnet route was learnt on. Upturn of that may be traffic sent to just that router, rather than ECMPd to both at that site as OSPF has equal routes to the subnet. This is unlikely, as internal routes are learnt by both CRs normally, and aren't selecting an iBGP route as best. Still it may be better to wait until T297355 is completed. There is no compelling need like there was in eqiad with the dual-IXP ports.

Change 745536 had a related patch set uploaded (by Cathal Mooney; author: Cathal Mooney):

[operations/homer/public@master] Removing routing policy iBGP_nh_self as it is no longer used.

https://gerrit.wikimedia.org/r/745536

Change 745536 merged by Cathal Mooney:

[operations/homer/public@master] Removing routing policy iBGP_nh_self as it is no longer used.

https://gerrit.wikimedia.org/r/745536

Closing this task. Setup in general needs to be considered under T297355