Page MenuHomePhabricator

Cleanup confed BGP peerings and policies
Open, MediumPublic

Description

Our BGP confederation peerings and policies have been a bit inconsistent since we set them up and we haven't invested much into them since the original deployment.

  • The first major issue is that we haven't really thought through the tradeoffs between doing multihop BGP peerings or not (with an almost arbitrary/hard to calculate max hop), between loopbacks or neighboring interfaces, (almost) meshed or between adjacent routers, next-hop-self or not. There are pros and cons with each and I don't believe we are consistent right now.
  • The second issue is that our aggregates between sites need to be cleaned up a little bit to at least establish proper boundaries (e.g. each site's private IP space on cr* only with protocol direct, each site's private mgmt space on mr1* only with protocol direct).
  • After that is done, we may or may want to consider splitting our IGP to one per subAS -- there are pros and cons with each of these.

The above two issues would help with rerouting/link recovery/packet loss in the case of various fiber cuts between our US-wide network (also see T167306).

  • Finally, the more user-visible issue that we have right now is that we're underutilizing eqord: we currently do not announce our supernets from eqord. The reason for this is that I hadn't found an easy way to guarantee that it wouldn't be announced if both eqiad<->eqord and eqord<->codfw was down, but eqord<->ulsfo and ulsfo<->codfw was up. The only solution that I could think of was splitting eqord in its own subAS and then doing a cross-subAS import policy with a ^65001 regexp.

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald Transcript

Best practices for confederation is to limit IGP within each confederation. The main advantages are reducing the blast radius if OSPF miss-behaves, and increasing convergence speed.

Externally, EBGP sessions (with BFD) between interfaces of routers in each sub-AS (like standard EBGP between AS# in the DFZ), this is to not need an IGP to share loopback IPs, as well a not need ebgp-multihop.

(e.g. each site's private IP space on cr* only with protocol direct, each site's private mgmt space on mr1* only with protocol direct).

I'm having issues visualizing this, I probably need some more explanations.

Finally, the more user-visible issue that we have right now is that we're underutilizing eqord: we currently do not announce our supernets from eqord. The reason for this is that I hadn't found an easy way to guarantee that it wouldn't be announced if both eqiad<->eqord and eqord<->codfw was down, but eqord<->ulsfo and ulsfo<->codfw was up. The only solution that I could think of was splitting eqord in its own subAS and then doing a cross-subAS import policy with a ^65001 regexp.

I'll need more explanations/conversation about this as well.

Some thoughts:
What is the probability of eqord becoming an (almost) island?
If it's high enough, separating it in its own subAS might be a good idea regardless of that routing question, especially based on its geographic distance from eqiad (its current confederation).
If low enough, and knowing that the consequence would a sub-optimal routing (higher latency, about +100ms), but no service outage, it might be okay to live with it, and manually stop advertising the supernet from there the day it happens.

Finally, the more user-visible issue that we have right now is that we're underutilizing eqord: we currently do not announce our supernets from eqord. The reason for this is that I hadn't found an easy way to guarantee that it wouldn't be announced if both eqiad<->eqord and eqord<->codfw was down, but eqord<->ulsfo and ulsfo<->codfw was up. The only solution that I could think of was splitting eqord in its own subAS and then doing a cross-subAS import policy with a ^65001 regexp.

Now that we also have a GRE tunnel from eqord to eqiad through a Transit provider, the risk of eqord having its codfw/eqiad transport link down while having its transit links up is very unlikely. I'd say it's now safe to advertise our prefixes from there.

For the eqord issue, this should works.
The`208.80.152.0/22` prefix gets created only if the router has (or learn) at least one contributing prefix (including in the /22) with a next-hop (ignores directly connected).
On top of that we use the new policy BGP_from_local_LVS to only accept (thus consider as contributing) prefixes learned from BGP, with an as path starting with 64600, which only match the LVS in the same confederation.

This means that if cr2-eqord lost BGP connectivity to eqiad, it wont have any contributing prefixes, and will remove the aggregate prefix.

I tested a simplified version of it in Juniper vLabs, and it's working as expected.

[edit routing-options rib inet6.0]
+    aggregate {
+        route 2620:0:860::/46 policy BGP_from_local_LVS;
+    }
[edit routing-options aggregate]
+    route 208.80.152.0/22 policy BGP_from_local_LVS;
[edit policy-options]
+   policy-statement BGP_from_local_LVS {
+       term BGP_local_LVS {
+           from {
+               protocol bgp;
+               as-path "^64600.*";
+           }
+           then accept;
+       }
+       then reject;
+   }

We can tune that several ways:

  • Only advertise eqiad's /23 and /48 instead of /46 + /22
  • Rename the policy to BGP_from_core_LVS and add 65002/65001 as valid AS_PATHs
  • Don't filter on AS_PATH and ensure POPs don't advertise eqiad/codfw prefixes to eqiad/codfw

It's also something that could be useful in eqdfw and knams, in case of fibercut there too.

That's an awesome idea, nice!

We can't advertise just the /23 + /48 from eqord as these would be more-specifics to what eqiad itself advertises - and thus all of the eqiad traffic would flow through eqord :)

Furthermore, I think it's fine for eqord to advertise the /22s, even if it's cut-off from eqiad, but still (directly) connected to codfw. The problems we're trying to address here would be a) being cut-off entirely (island), and b) being cut-off from eqiad + codfw, and traffic re-routing through ulsfo, i.e. internet -> eqord -> ulsfo -> codfw (-> eqiad), a very high-latency path. So indeed, we can add 65002 to a valid AS_PATH as you suggest.

We can also add 65001 as a valid AS_PATH, and after that we can consider eqord a separate confed subAS. The original idea of making it part of 65001's has always been something I've had second thoughts on - it's ~equal distance between eqiad and codfw after all.

Sounds good, final version, including both AS 65002 and AS 65001 as optional to keep it generic.
Tested the regex using show route aspath-regex "^(65002|65001)? 64600.*"
Will push IPv6 first, then 24h later IPv4 if everything is fine.

[edit routing-options rib inet6.0]
+    aggregate {
+        route 2620:0:860::/46 policy BGP_from_core_LVS;
+    }
[edit routing-options aggregate]
+    route 208.80.152.0/22 policy BGP_from_core_LVS;
[edit policy-options]
+   policy-statement BGP_from_core_LVS {
+       term BGP_core_LVS {
+           from {
+               protocol bgp;
+               as-path core_and_local_LVS;
+           }
+           then accept;
+       }
+       then reject;
+   }
[edit policy-options]
    as-path too-many-hops { ... }
+   as-path core_and_local_LVS "^(65002|65001)? 64600.*";

Mentioned in SAL (#wikimedia-operations) [2019-08-14T21:37:53Z] <XioNoX> advertise core v6 range (2620:0:860::/46) from eqord - T167841

Mentioned in SAL (#wikimedia-operations) [2019-08-15T16:27:44Z] <XioNoX> advertise core v4 range (208.80.152.0/22) from eqord - T167841

Change 547678 had a related patch set uploaded (by Ayounsi; owner: Ayounsi):
[operations/homer/public@master] Add BGP_from_core_LVS policy

https://gerrit.wikimedia.org/r/547678

Change 547678 merged by Ayounsi:
[operations/homer/public@master] Add BGP_from_LVS policy

https://gerrit.wikimedia.org/r/547678

The first major issue is that we haven't really thought through the tradeoffs between doing multihop BGP peerings or not (with an almost arbitrary/hard to calculate max hop), between loopbacks or neighboring interfaces, (almost) meshed or between adjacent routers, next-hop-self or not. There are pros and cons with each and I don't believe we are consistent right now.

Agreed there is a lot to be thought through here. Might it be worth us creating a pros/cons list here for discussion? None of the options are terrible, all of them have trade-offs as with anything.

The second issue is that our aggregates between sites need to be cleaned up a little bit to at least establish proper boundaries (e.g. each site's private IP space on cr* only with protocol direct, each site's private mgmt space on mr1* only with protocol direct). After that is done, we may or may want to consider splitting our IGP to one per subAS -- there are pros and cons with each of these.

I'm with @ayounsi on this one I'd appreciate a bit more clarity on the issue here to properly understand.

After that is done, we may or may want to consider splitting our IGP to one per subAS -- there are pros and cons with each of these.

This is probably the least pressing I think. If we were a massive global ISP hands down we'd want to limit the size of the IGP area. But we are well within any kind of limit for size of flooding domain, I don't think there is a scaling issue for us to worry about here. It comes down to more subtle things like failover, convergence, interaction with BGP / traffic engineering.

Traffic Engineering

The other thing I'd throw into the discussion (just to confuse and annoy everyone), is whether we could see any value in running MPLS/SR across the WAN at any stage? I'm too fresh to have a strong feeling here tbh. But perhaps given the way our network is set up the traffic eng capabilities might be useful to us. It's been some time since I worked with MPLS/RSVP-TE, and never done SR, so would need to do a good bit of research on how to configure it if we decided to go that road.

I strongly believe in avoiding complexity where possible, so I'd only go down this path if we truly thought there were good benefits. Just raising here as it's relevant to the BGP/IGP discussion.