Page MenuHomePhabricator

Cleanup confed BGP peerings and policies
Open, NormalPublic

Description

Our BGP confederation peerings and policies have been a bit inconsistent since we set them up and we haven't invested much into them since the original deployment.

  • The first major issue is that we haven't really thought through the tradeoffs between doing multihop BGP peerings or not (with an almost arbitrary/hard to calculate max hop), between loopbacks or neighboring interfaces, (almost) meshed or between adjacent routers, next-hop-self or not. There are pros and cons with each and I don't believe we are consistent right now.
  • The second issue is that our aggregates between sites need to be cleaned up a little bit to at least establish proper boundaries (e.g. each site's private IP space on cr* only with protocol direct, each site's private mgmt space on mr1* only with protocol direct).
  • After that is done, we may or may want to consider splitting our IGP to one per subAS -- there are pros and cons with each of these.

The above two issues would help with rerouting/link recovery/packet loss in the case of various fiber cuts between our US-wide network (also see T167306).

  • Finally, the more user-visible issue that we have right now is that we're underutilizing eqord: we currently do not announce our supernets from eqord. The reason for this is that I hadn't found an easy way to guarantee that it wouldn't be announced if both eqiad<->eqord and eqord<->codfw was down, but eqord<->ulsfo and ulsfo<->codfw was up. The only solution that I could think of was splitting eqord in its own subAS and then doing a cross-subAS import policy with a ^65001 regexp.

Event Timeline

faidon created this task.Jun 13 2017, 10:45 PM
Restricted Application added a project: Operations. · View Herald TranscriptJun 13 2017, 10:45 PM
Restricted Application added a subscriber: Aklapper. · View Herald Transcript
elukey added a subscriber: elukey.Jun 14 2017, 7:16 AM

Best practices for confederation is to limit IGP within each confederation. The main advantages are reducing the blast radius if OSPF miss-behaves, and increasing convergence speed.

Externally, EBGP sessions (with BFD) between interfaces of routers in each sub-AS (like standard EBGP between AS# in the DFZ), this is to not need an IGP to share loopback IPs, as well a not need ebgp-multihop.

(e.g. each site's private IP space on cr* only with protocol direct, each site's private mgmt space on mr1* only with protocol direct).

I'm having issues visualizing this, I probably need some more explanations.

Finally, the more user-visible issue that we have right now is that we're underutilizing eqord: we currently do not announce our supernets from eqord. The reason for this is that I hadn't found an easy way to guarantee that it wouldn't be announced if both eqiad<->eqord and eqord<->codfw was down, but eqord<->ulsfo and ulsfo<->codfw was up. The only solution that I could think of was splitting eqord in its own subAS and then doing a cross-subAS import policy with a ^65001 regexp.

I'll need more explanations/conversation about this as well.

Some thoughts:
What is the probability of eqord becoming an (almost) island?
If it's high enough, separating it in its own subAS might be a good idea regardless of that routing question, especially based on its geographic distance from eqiad (its current confederation).
If low enough, and knowing that the consequence would a sub-optimal routing (higher latency, about +100ms), but no service outage, it might be okay to live with it, and manually stop advertising the supernet from there the day it happens.

ayounsi moved this task from Backlog to Configuration on the netops board.Jun 27 2017, 2:38 PM

Finally, the more user-visible issue that we have right now is that we're underutilizing eqord: we currently do not announce our supernets from eqord. The reason for this is that I hadn't found an easy way to guarantee that it wouldn't be announced if both eqiad<->eqord and eqord<->codfw was down, but eqord<->ulsfo and ulsfo<->codfw was up. The only solution that I could think of was splitting eqord in its own subAS and then doing a cross-subAS import policy with a ^65001 regexp.

Now that we also have a GRE tunnel from eqord to eqiad through a Transit provider, the risk of eqord having its codfw/eqiad transport link down while having its transit links up is very unlikely. I'd say it's now safe to advertise our prefixes from there.

For the eqord issue, this should works.
The`208.80.152.0/22` prefix gets created only if the router has (or learn) at least one contributing prefix (including in the /22) with a next-hop (ignores directly connected).
On top of that we use the new policy BGP_from_local_LVS to only accept (thus consider as contributing) prefixes learned from BGP, with an as path starting with 64600, which only match the LVS in the same confederation.

This means that if cr2-eqord lost BGP connectivity to eqiad, it wont have any contributing prefixes, and will remove the aggregate prefix.

I tested a simplified version of it in Juniper vLabs, and it's working as expected.

[edit routing-options rib inet6.0]
+    aggregate {
+        route 2620:0:860::/46 policy BGP_from_local_LVS;
+    }
[edit routing-options aggregate]
+    route 208.80.152.0/22 policy BGP_from_local_LVS;
[edit policy-options]
+   policy-statement BGP_from_local_LVS {
+       term BGP_local_LVS {
+           from {
+               protocol bgp;
+               as-path "^64600.*";
+           }
+           then accept;
+       }
+       then reject;
+   }

We can tune that several ways:

  • Only advertise eqiad's /23 and /48 instead of /46 + /22
  • Rename the policy to BGP_from_core_LVS and add 65002/65001 as valid AS_PATHs
  • Don't filter on AS_PATH and ensure POPs don't advertise eqiad/codfw prefixes to eqiad/codfw

It's also something that could be useful in eqdfw and knams, in case of fibercut there too.

That's an awesome idea, nice!

We can't advertise just the /23 + /48 from eqord as these would be more-specifics to what eqiad itself advertises - and thus all of the eqiad traffic would flow through eqord :)

Furthermore, I think it's fine for eqord to advertise the /22s, even if it's cut-off from eqiad, but still (directly) connected to codfw. The problems we're trying to address here would be a) being cut-off entirely (island), and b) being cut-off from eqiad + codfw, and traffic re-routing through ulsfo, i.e. internet -> eqord -> ulsfo -> codfw (-> eqiad), a very high-latency path. So indeed, we can add 65002 to a valid AS_PATH as you suggest.

We can also add 65001 as a valid AS_PATH, and after that we can consider eqord a separate confed subAS. The original idea of making it part of 65001's has always been something I've had second thoughts on - it's ~equal distance between eqiad and codfw after all.

ayounsi added a comment.EditedAug 14 2019, 9:23 PM

Sounds good, final version, including both AS 65002 and AS 65001 as optional to keep it generic.
Tested the regex using show route aspath-regex "^(65002|65001)? 64600.*"
Will push IPv6 first, then 24h later IPv4 if everything is fine.

[edit routing-options rib inet6.0]
+    aggregate {
+        route 2620:0:860::/46 policy BGP_from_core_LVS;
+    }
[edit routing-options aggregate]
+    route 208.80.152.0/22 policy BGP_from_core_LVS;
[edit policy-options]
+   policy-statement BGP_from_core_LVS {
+       term BGP_core_LVS {
+           from {
+               protocol bgp;
+               as-path core_and_local_LVS;
+           }
+           then accept;
+       }
+       then reject;
+   }
[edit policy-options]
    as-path too-many-hops { ... }
+   as-path core_and_local_LVS "^(65002|65001)? 64600.*";

Mentioned in SAL (#wikimedia-operations) [2019-08-14T21:37:53Z] <XioNoX> advertise core v6 range (2620:0:860::/46) from eqord - T167841

Mentioned in SAL (#wikimedia-operations) [2019-08-15T16:27:44Z] <XioNoX> advertise core v4 range (208.80.152.0/22) from eqord - T167841

ayounsi updated the task description. (Show Details)Aug 15 2019, 6:51 PM