Page MenuHomePhabricator

Agree strategy for Kubernetes BGP peering to top-of-rack switches
Closed, ResolvedPublic

Description

The switches in rows E and F in Eqiad, and in drmrs, are set up to provide a routed access layer. In other words each rack has a different subnet / vlan, and the switch acts as the default gateway for local servers on that subnet.

This causes an issue for our Kubernetes hosts, which use the Calico framework to configure networking and set up BGP peerings to the physical network. Currently each K8s server peers with the two core routers at each site, to their loopback IPs, so the peering will establish no matter what row/subnet the server is in. The BGP peer IPs are defined globally for each site, for example:

https://gerrit.wikimedia.org/r/plugins/gitiles/operations/deployment-charts/+/refs/heads/master/helmfile.d/admin_ng/values/ml-serve-eqiad/calico-values.yaml

This works well for the existing setup, where all switches are layer-2 only and the core routers are the IP gateway for all hosts. With layer-3 access switches this will not work, however. The K8s hosts will instead need to form a BGP adjacency with the top-of-rack switch.

There appears to be some scope in Calico to define BGP peers per-node, using a label-selector:

https://projectcalico.docs.tigera.io/networking/bgp#configure-a-per-node-bgp-peer

There may be other options also, host or network side. Creating this task to facilitate discussion on the topic and try to arrive at a solution.

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

Change 786264 had a related patch set uploaded (by Elukey; author: Elukey):

[operations/deployment-charts@master] Add calico BGP peering settings for ml-serve100[5-8]

https://gerrit.wikimedia.org/r/786264

Created https://gerrit.wikimedia.org/r/786264 to kick off the discussion about the next steps, let me know if the configuration is ok or not :)

The follow up question that I have is about https://gerrit.wikimedia.org/r/c/operations/homer/public/+/784703 - should it stay the same, or should the new peers be the ToR IPs?

@elukey thanks for the patch, certainly looks ok to me, if indeed it works in terms of the Calico config :)

Regarding the homer patch the IPs are correct, however we will need to move the config from sites.yaml to devices.yaml, so the config is applied only to the required switches, rather than define under the site itself. I'll prep a patch shortly for that.

I've got most of the BGP policy in place now, there was quite a bit of work refactoring things I had to do based on discussions we had in netops previously. I'll go ahead and prep the other patch now after which we can focus on the Calico side.

Change 786264 merged by Elukey:

[operations/deployment-charts@master] Add calico BGP peering settings for ml-serve100[5-8]

https://gerrit.wikimedia.org/r/786264

The above patch is working, however I'm not 100% the resulting config is what we need. Looking, for instance, at ml-serve1005, it has established BGP peering to the top-of-rack switch, but it is still trying (and failing) to connect to the CR routers:

cmooney@ml-serve1005:~$ ss -tulpna | grep ":179" | sort
tcp   ESTAB     0      0                       10.64.131.3:35021                   10.64.131.1:179         
tcp   ESTAB     0      0      [2620:0:861:10a:10:64:131:3]:47779           [2620:0:861:10a::1]:179         
tcp   LISTEN    0      8                           0.0.0.0:179                         0.0.0.0:*           
tcp   LISTEN    0      8                              [::]:179                            [::]:*           
tcp   SYN-SENT  0      1                       10.64.131.3:33441                208.80.154.197:179         
tcp   SYN-SENT  0      1                       10.64.131.3:59643                208.80.154.196:179         
tcp   SYN-SENT  0      1      [2620:0:861:10a:10:64:131:3]:40939          [2620:0:861:ffff::2]:179         
tcp   SYN-SENT  0      1      [2620:0:861:10a:10:64:131:3]:45565          [2620:0:861:ffff::1]:179

I assume this is because those peers are defined in the config as 'global', i.e. with no 'node' constraint:

cmooney@ml-serve1005:~$ sudo calicoctl get bgpPeer
NAME                 PEERIP               NODE                       ASN     
cr1-eqiad-ipv4       208.80.154.196       (global)                   14907   
cr1-eqiad-ipv6       2620:0:861:ffff::1   (global)                   14907   
cr2-eqiad-ipv4       208.80.154.197       (global)                   14907   
cr2-eqiad-ipv6       2620:0:861:ffff::2   (global)                   14907   
lsw1-e2-eqiad-ipv4   10.64.131.1          ml-serve1005.eqiad.wmnet   14907   
lsw1-e2-eqiad-ipv6   2620:0:861:10a::1    ml-serve1005.eqiad.wmnet   14907   
lsw1-e3-eqiad-ipv4   10.64.132.1          ml-serve1006.eqiad.wmnet   14907   
lsw1-e3-eqiad-ipv6   2620:0:861:10b::1    ml-serve1006.eqiad.wmnet   14907   
lsw1-f2-eqiad-ipv4   10.64.135.1          ml-serve1007.eqiad.wmnet   14907   
lsw1-f2-eqiad-ipv6   2620:0:861:10e::1    ml-serve1007.eqiad.wmnet   14907   
lsw1-f3-eqiad-ipv4   10.64.136.1          ml-serve1008.eqiad.wmnet   14907   
lsw1-f3-eqiad-ipv6   2620:0:861:10f::1    ml-serve1008.eqiad.wmnet   14907

I'm also curious as to how this config scales, can we define multiple hosts after "node" in calico-values.yaml? For when we have multiple nodes connected to each switch?

Or do we need to look instead at using nodeSelector, and defining the ToR for each host somewhere? I see for the kubernetes* hosts they have per-server variables defined in puppet here:

cmooney@wikilap:~/repos/puppet/hieradata/hosts$ grep failure-domain.beta.kubernetes.io *
kubernetes1005.yaml:  - failure-domain.beta.kubernetes.io/region=eqiad
kubernetes1005.yaml:  - failure-domain.beta.kubernetes.io/zone=row-a
kubernetes1006.yaml:  - failure-domain.beta.kubernetes.io/region=eqiad
kubernetes1006.yaml:  - failure-domain.beta.kubernetes.io/zone=row-c
kubernetes1007.yaml:  - failure-domain.beta.kubernetes.io/region=eqiad
kubernetes1007.yaml:  - failure-domain.beta.kubernetes.io/zone=row-a
kubernetes1008.yaml:  - failure-domain.beta.kubernetes.io/region=eqiad
kubernetes1008.yaml:  - failure-domain.beta.kubernetes.io/zone=row-a
kubernetes1009.yaml:  - failure-domain.beta.kubernetes.io/region=eqiad
kubernetes1009.yaml:  - failure-domain.beta.kubernetes.io/zone=row-b
kubernetes1010.yaml:  - failure-domain.beta.kubernetes.io/region=eqiad
kubernetes1010.yaml:  - failure-domain.beta.kubernetes.io/zone=row-b
kubernetes1011.yaml:  - failure-domain.beta.kubernetes.io/region=eqiad
kubernetes1011.yaml:  - failure-domain.beta.kubernetes.io/zone=row-c
kubernetes1012.yaml:  - failure-domain.beta.kubernetes.io/region=eqiad
kubernetes1012.yaml:  - failure-domain.beta.kubernetes.io/zone=row-c
kubernetes1013.yaml:  - failure-domain.beta.kubernetes.io/region=eqiad
kubernetes1013.yaml:  - failure-domain.beta.kubernetes.io/zone=row-d
kubernetes1014.yaml:  - failure-domain.beta.kubernetes.io/region=eqiad
kubernetes1014.yaml:  - failure-domain.beta.kubernetes.io/zone=row-d
kubernetes1015.yaml:  - failure-domain.beta.kubernetes.io/region=eqiad
kubernetes1015.yaml:  - failure-domain.beta.kubernetes.io/zone=row-b
kubernetes1016.yaml:  - failure-domain.beta.kubernetes.io/region=eqiad
kubernetes1016.yaml:  - failure-domain.beta.kubernetes.io/zone=row-d
kubernetes1017.yaml:  - failure-domain.beta.kubernetes.io/region=eqiad
kubernetes1017.yaml:  - failure-domain.beta.kubernetes.io/zone=row-a
kubernetes1018.yaml:  - failure-domain.beta.kubernetes.io/region=eqiad
kubernetes1018.yaml:  - failure-domain.beta.kubernetes.io/zone=row-a
kubernetes1019.yaml:  - failure-domain.beta.kubernetes.io/region=eqiad
kubernetes1019.yaml:  - failure-domain.beta.kubernetes.io/zone=row-b
kubernetes1020.yaml:  - failure-domain.beta.kubernetes.io/region=eqiad
kubernetes1020.yaml:  - failure-domain.beta.kubernetes.io/zone=row-c
kubernetes1021.yaml:  - failure-domain.beta.kubernetes.io/region=eqiad
kubernetes1021.yaml:  - failure-domain.beta.kubernetes.io/zone=row-d
kubernetes1022.yaml:  - failure-domain.beta.kubernetes.io/region=eqiad
kubernetes1022.yaml:  - failure-domain.beta.kubernetes.io/zone=row-b
kubernetes2005.yaml:  - failure-domain.beta.kubernetes.io/region=codfw
kubernetes2005.yaml:  - failure-domain.beta.kubernetes.io/zone=row-a
kubernetes2006.yaml:  - failure-domain.beta.kubernetes.io/region=codfw
kubernetes2006.yaml:  - failure-domain.beta.kubernetes.io/zone=row-b
kubernetes2007.yaml:  - failure-domain.beta.kubernetes.io/region=codfw
kubernetes2007.yaml:  - failure-domain.beta.kubernetes.io/zone=row-a
kubernetes2008.yaml:  - failure-domain.beta.kubernetes.io/region=codfw
kubernetes2008.yaml:  - failure-domain.beta.kubernetes.io/zone=row-a
kubernetes2009.yaml:  - failure-domain.beta.kubernetes.io/region=codfw
kubernetes2009.yaml:  - failure-domain.beta.kubernetes.io/zone=row-b
kubernetes2010.yaml:  - failure-domain.beta.kubernetes.io/region=codfw
kubernetes2010.yaml:  - failure-domain.beta.kubernetes.io/zone=row-b
kubernetes2011.yaml:  - failure-domain.beta.kubernetes.io/region=codfw
kubernetes2011.yaml:  - failure-domain.beta.kubernetes.io/zone=row-c
kubernetes2012.yaml:  - failure-domain.beta.kubernetes.io/region=codfw
kubernetes2012.yaml:  - failure-domain.beta.kubernetes.io/zone=row-c
kubernetes2013.yaml:  - failure-domain.beta.kubernetes.io/region=codfw
kubernetes2013.yaml:  - failure-domain.beta.kubernetes.io/zone=row-d
kubernetes2014.yaml:  - failure-domain.beta.kubernetes.io/region=codfw
kubernetes2014.yaml:  - failure-domain.beta.kubernetes.io/zone=row-d
kubernetes2015.yaml:  - failure-domain.beta.kubernetes.io/region=codfw
kubernetes2015.yaml:  - failure-domain.beta.kubernetes.io/zone=row-c
kubernetes2016.yaml:  - failure-domain.beta.kubernetes.io/region=codfw
kubernetes2016.yaml:  - failure-domain.beta.kubernetes.io/zone=row-d
kubernetes2017.yaml:  - failure-domain.beta.kubernetes.io/region=codfw
kubernetes2017.yaml:  - failure-domain.beta.kubernetes.io/zone=row-c
kubernetes2018.yaml:  - failure-domain.beta.kubernetes.io/region=codfw
kubernetes2018.yaml:  - failure-domain.beta.kubernetes.io/zone=row-a
kubernetes2019.yaml:  - failure-domain.beta.kubernetes.io/region=codfw
kubernetes2019.yaml:  - failure-domain.beta.kubernetes.io/zone=row-a
kubernetes2020.yaml:  - failure-domain.beta.kubernetes.io/region=codfw
kubernetes2020.yaml:  - failure-domain.beta.kubernetes.io/zone=row-b
kubernetes2021.yaml:  - failure-domain.beta.kubernetes.io/region=codfw
kubernetes2021.yaml:  - failure-domain.beta.kubernetes.io/zone=row-c
kubernetes2022.yaml:  - failure-domain.beta.kubernetes.io/region=codfw
kubernetes2022.yaml:  - failure-domain.beta.kubernetes.io/zone=row-d
kubestage1003.yaml:  - failure-domain.beta.kubernetes.io/region=eqiad
kubestage1003.yaml:  - failure-domain.beta.kubernetes.io/zone=row-b
kubestage1004.yaml:  - failure-domain.beta.kubernetes.io/region=eqiad
kubestage1004.yaml:  - failure-domain.beta.kubernetes.io/zone=row-d
kubestage2001.yaml:  - failure-domain.beta.kubernetes.io/region=codfw
kubestage2001.yaml:  - failure-domain.beta.kubernetes.io/zone=row-a
kubestage2002.yaml:  - failure-domain.beta.kubernetes.io/region=codfw
kubestage2002.yaml:  - failure-domain.beta.kubernetes.io/zone=row-b

Can we define an additional label there for rack and match on that? The ml-serve nodes do not have individual files defined for them though, is it perhaps possible to define in hieradata/role/eqiad/ml_k8s/master.yaml?

We have discussed this issue in the serviceops channel yesterday, and the idea is to indeed use labels. The ML clusters do not define failure-domain.beta.kubernetes.io at the moment, but we are going to add them asap (I'd say manually, with a comment in Hiera to be deployed when the nodes are reimaged).
The idea for the new ToR-specific configuration is to add a label called wikimedia.org/node-location (or something similar), for example:

lsw1-f3-eqiad-ipv6:
  nodeSelector: wikimedia.org/node-location == lsw1-f3-eqiad
  asNumber: 14907
  peerIP: 2620:0:861:10f::1

And then we'll add the proper label to each Kubernetes node. This will allow multiple nodes in the same rack to share the same config.

For the global calico config, this is a really good point. As far as I know we configure Calico in a full mesh setup, but probably this needs to change sooner than expected due to the ROW E/F configurations. One thing that we could do, to avoid ml-serve1005 to try to establish BGP sessions with the cr{1,2}-eqiad routers, could be to change the Calico config like the following:

cr1-eqiad-ipv4:
  asNumber: 14907
  peerIP: "208.80.154.196"
  nodeSelector: !(wikimedia.org/node-location starts with 'lsw1-f') && !(wikimedia.org/node-location starts with 'lsw1-e')

This will change the topology of the BGP mesh though: some nodes we'll have multiple BGP sessions with core routers, and others only one directed to the ToR (more precisely, one for ipv4 and one for ipv6).

This will change the topology of the BGP mesh though: some nodes we'll have multiple BGP sessions with core routers, and others only one directed to the ToR (more precisely, one for ipv4 and one for ipv6).

In line of principle that doesn't seem like a huge issue, but I'm not 100% sure what that means for things like e.g. service IPs.

Thanks for the updates. Sounds like a good plan!

In terms of the configuration for the CR "nodeSelector" filter I think we are probably best to change it to match both 'lsw' and 'asw':

nodeSelector: !(wikimedia.org/node-location starts with 'lsw') && !(wikimedia.org/node-location starts with 'asw')

We currently have both with a routed access layer ('lsw' are the leaf nodes of the Eqiad evpn spine/leaf in the new rows, 'asw' in drmrs are layer-3 but retain the old naming as there is no physical leaf-spine topology). We'll have more 'lsw' in eqiad and codfw in the coming months.

This will change the topology of the BGP mesh though: some nodes we'll have multiple BGP sessions with core routers, and others only one directed to the ToR (more precisely, one for ipv4 and one for ipv6).

In line of principle that doesn't seem like a huge issue, but I'm not 100% sure what that means for things like e.g. service IPs.

Yeah I'd be surprised if that is going to make any practical difference.

On reflection the above won't work if we're going to add the 'node-location' for all existing hosts, which I assume is the intent. So the selector should probably be as follows for the CR filter:

nodeSelector: !(wikimedia.org/node-location starts with 'lsw')

If we get k8s nodes in drmrs that may have to change but for now it's fine.

FWIW I pulled the locations for all the existing Kubernetes hosts in case it would help: P26711

However, this made me realize, that a bunch of the K8s devices doing BGP to the core routers currently are VMs: P26713

For now that is not an issue, we don't have any Ganeti clusters in the new Eqiad rows. Exactly how we will connect Ganeti hosts there has not been agreed. If we stick with the row-wide Vlans/subnets for Ganeti then we probably want to set the "node-location" to the Ganeti group it belongs to (see T262446), which corresponds to the row. The VMs can then be configured with the same gateway IP as their BGP peer across the row, resulting in a connection to the 'anycast gw' IP on their directly connected switch. But again that's for down the road, we'll need to consider/test in relation to the overall discussion on Ganeti in these rows.

That's great! FYI, thanks to John rack and Ganeti cluster info from Netbox is available in Puppet, which could help automate the labels assignment. T229397: Puppet: get data (row, rack, site, and other information) from Netbox

Similarly, the LLDP fact can be used to know what the ToR is.

If there is any kind of anycast with the k8s prefixes (same prefix advertised from multiple locations), we should also prepend the AS once on the core routers to keep path lengths consistent across the infra.

We have discussed this issue in the serviceops channel yesterday, and the idea is to indeed use labels. The ML clusters do not define failure-domain.beta.kubernetes.io at the moment, but we are going to add them asap (I'd say manually, with a comment in Hiera to be deployed when the nodes are reimaged).
The idea for the new ToR-specific configuration is to add a label called wikimedia.org/node-location (or something similar), for example:

lsw1-f3-eqiad-ipv6:
  nodeSelector: wikimedia.org/node-location == lsw1-f3-eqiad
  asNumber: 14907
  peerIP: 2620:0:861:10f::1

Yup, that was always the plan. My only gripe is that it's going to be a pretty long and cryptic config. But at least it's not going to be changing often. We 'll have to deduplicate it a bit too across cluster before it becomes unyieldy.

And then we'll add the proper label to each Kubernetes node. This will allow multiple nodes in the same rack to share the same config.

For the global calico config, this is a really good point. As far as I know we configure Calico in a full mesh setup, but probably this needs to change sooner than expected due to the ROW E/F configurations. One thing that we could do, to avoid ml-serve1005 to try to establish BGP sessions with the cr{1,2}-eqiad routers, could be to change the Calico config like the following:

I fear we don't configure Calico in a full mesh setup. It wouldn't work in our environment as in all cross rack row communications (I am using legacy row terminology here, not e/f row one), the next hop must be the crs, not another k8s node. The default behavior of iBGP (and thus bird/calico's) is next hop self. That's configurable by using keepOriginalNextHop in calico, but it wasn't back when we implemented our setup. In any case, it wouldn't solve anything as again we would need to go via the crs. So, a full mesh setup right now to make sense would require a layer-2 network connecting all k8s nodes to all k8s nodes. But also a full mesh doesn't scale well and we plan to scale to hundreds of nodes eventually.

The fact full mesh is disabled can be verified by:

akosiaris@ml-serve1006:~$ sudo calicoctl get bgpConfiguration
NAME      LOGSEVERITY   MESHENABLED   ASNUMBER   
default                 false         64606

as well as a git grep nodeToNodeMeshEnabled in deployments-charts

cr1-eqiad-ipv4:
  asNumber: 14907
  peerIP: "208.80.154.196"
  nodeSelector: !(wikimedia.org/node-location starts with 'lsw1-f') && !(wikimedia.org/node-location starts with 'lsw1-e')

This will change the topology of the BGP mesh though: some nodes we'll have multiple BGP sessions with core routers, and others only one directed to the ToR (more precisely, one for ipv4 and one for ipv6).

Yup. I may be wrong here, but this is what we want, right? we only have 1 ToR per rack anyway and peering with the crs isn't desirable in the new rack rows.

This will change the topology of the BGP mesh though: some nodes we'll have multiple BGP sessions with core routers, and others only one directed to the ToR (more precisely, one for ipv4 and one for ipv6).

In line of principle that doesn't seem like a huge issue, but I'm not 100% sure what that means for things like e.g. service IPs.

service IPs are not advertised via BGP. We planned to do so but it ended up being a mess with all the NAT that needed to happen and we did not follow through after all. Gory details at: T290967. There will be 0 repercussions.

That's great! FYI, thanks to John rack and Ganeti cluster info from Netbox is available in Puppet, which could help automate the labels assignment. T229397: Puppet: get data (row, rack, site, and other information) from Netbox

YES!!! I 've looked at this recently and it's pretty awesome. We can get away from all those per host hiera files and on to something more automated (which should be usable in this case too).

Similarly, the LLDP fact can be used to know what the ToR is.

If there is any kind of anycast with the k8s prefixes (same prefix advertised from multiple locations), we should also prepend the AS once on the core routers to keep path lengths consistent across the infra.

No, I don't think we have that use case yet. We prefer to switch between DCs via the discovery. But good point for a future use case if it ever presents itself.

Finally, recommendations for a l3 interconnect fabric (like our case) by Calico are in https://projectcalico.docs.tigera.io/reference/architecture/design/l3-interconnect-fabric. They are a bit high level, but explore the various models and give recommendations. I don't think our model is any of those though?

If there is any kind of anycast with the k8s prefixes (same prefix advertised from multiple locations), we should also prepend the AS once on the core routers to keep path lengths consistent across the infra.

Yeah I had considered that alright, in checking I found what Alex has confirmed (we don't have it right now).

My only gripe is that it's going to be a pretty long and cryptic config. But at least it's not going to be changing often. We 'll have to deduplicate it a bit too across cluster before it becomes unyieldy.

Yeah apologies for that. Happy to help in whatever way we can to assist getting this config together. At the very least as we add new racks/switches we should be able to complete the definition/add the IP. The peer on each subnet will be the default GW configured on the host too, and we can probably ensure it's the first usable IP on the subnet. I suspect that doesn't help much but want to mention it.

So, a full mesh setup right now to make sense would require a layer-2 network connecting all k8s nodes to all k8s nodes. But also a full mesh doesn't scale well and we plan to scale to hundreds of nodes eventually.

Correct. A full-mesh between the Calico/K8s nodes would effectively be "over the top" of the physical network. And the physical network then remains in the dark about what IPs are used where, which can be inefficient. Much better, IMO, to peer between the physical switches and end-hosts.

Finally, recommendations for a l3 interconnect fabric (like our case) by Calico are in https://projectcalico.docs.tigera.io/reference/architecture/design/l3-interconnect-fabric. They are a bit high level, but explore the various models and give recommendations. I don't think our model is any of those though?

The new switches in row E/F are configured similar to option 1 at the top of that link, in that we've OSPF + (EVPN) iBGP.

The page has sensible suggestions/designs, but I'm not sure there is any need for us to modify anything. eBGP to the K8s nodes makes sense, and re-using the same ASN on all of them makes configuration easier. The current configuration "fakes" our public ASN (14907) when peering from each ToR switch to the Calico nodes. Which is a bit of a hack but it means they don't need to worry about what the ASN of the switch is, or if they are different per rack/row/region etc.

If we needed to learn routes on K8s nodes via BGP, esp. those that were originated by other K8s nodes, then we might need to review things (by default they'd reject prefixes with their own ASN in the path). But as we use a static default route for outbound traffic that's not a concern right now.

If there is any kind of anycast with the k8s prefixes (same prefix advertised from multiple locations), we should also prepend the AS once on the core routers to keep path lengths consistent across the infra.

Yeah I had considered that alright, in checking I found what Alex has confirmed (we don't have it right now).

My only gripe is that it's going to be a pretty long and cryptic config. But at least it's not going to be changing often. We 'll have to deduplicate it a bit too across cluster before it becomes unyieldy.

Yeah apologies for that. Happy to help in whatever way we can to assist getting this config together. At the very least as we add new racks/switches we should be able to complete the definition/add the IP. The peer on each subnet will be the default GW configured on the host too, and we can probably ensure it's the first usable IP on the subnet. I suspect that doesn't help much but want to mention it.

No worries, we knew this work would need to be done when reviewing the new rows design document. We 'll have to see whether we 'll need some automation to generate the config on our side more easily. I am not too worried about that, I expect most IPs/switch names to be relatively static. I 'd just like to avoid duplication between the various kubernetes cluster we have (https://wikitech.wikimedia.org/wiki/Kubernetes/Clusters) as I expect they 'll share greatly the configuration.

What's more interesting to me is, given that the plan is to ramp up MediaWiki traffic to kubernetes in the next 6 months or so, we 'll need to increase the size of our Wikikube k8s cluster from the current 18 nodes to either the high tens or the low hundreds. Even in the legacy setup (pre row e/f) adding new nodes requires manual error-prone gerrit changes like this one 35b0c9a4832068d08. With the need to also keep node <-> ToR mappings around, this will become pretty error prone and difficult to maintain. So, some automation for homer to make adding new kubernetes nodes to ToR switches configuration easier will become off the essence.

Finally, recommendations for a l3 interconnect fabric (like our case) by Calico are in https://projectcalico.docs.tigera.io/reference/architecture/design/l3-interconnect-fabric. They are a bit high level, but explore the various models and give recommendations. I don't think our model is any of those though?

The new switches in row E/F are configured similar to option 1 at the top of that link, in that we've OSPF + (EVPN) iBGP.

It is the closest analogy for me as well, aside for the fact that we don't do the "As Per Rack" stuff (correct me if I am wrong).

The page has sensible suggestions/designs, but I'm not sure there is any need for us to modify anything. eBGP to the K8s nodes makes sense, and re-using the same ASN on all of them makes configuration easier. The current configuration "fakes" our public ASN (14907) when peering from each ToR switch to the Calico nodes. Which is a bit of a hack but it means they don't need to worry about what the ASN of the switch is, or if they are different per rack/row/region etc.

Agreed. This makes configuration way easier.

If we needed to learn routes on K8s nodes via BGP, esp. those that were originated by other K8s nodes, then we might need to review things (by default they'd reject prefixes with their own ASN in the path). But as we use a static default route for outbound traffic that's not a concern right now.

Regarding that, let me recap a bit on why we don't yet peer k8s nodes in a mesh for posterity

  • In the rows a-d case, peering k8s nodes in a per-row-mesh (peering across rows wouldn't work as pointed out above), could provide a benefit, as traffic that is destined for another k8s node in the same row would not need to reach the core routers but rather stay in the row. This would an optimization for what is called East-West traffic within the row and would remove some load from upstream links between rows and core routers. The thing is, that cumulative North-South and East-West traffic for the kubernetes clusters peaks at 300MB/s (or 2.4Gbps), with the average being 235MB/s (or ~1.9Gbps)[1]. That's across 4 rows and 18 nodes, so we are talking about ~0.5Gbps per row. And we know that currently most of our traffic is probably (I don't think we have numbers though) North-South (our traffic is heavily end-user HTTP after all), not East-West (although the more we deploy microservices, the more this changes). So for now, this isn't causing a problem as switch upstreams are have 40Gbps capacity
  • In the rows e/f case, peering k8s nodes in a per-rack-mesh (peering across racks or rows wouldn't work for the same reasons as above), wouldn't even provide that benefit. a) The amount of East-West traffic is going to be even smaller given the smaller number of machines in a rack b) Traffic would need to reach the ToR anyway since it performs both switching and routing, even for intra-rack endpoints. Yes, switching may be cheaper than routing for the ToR, but unless we see an issue with the ToRs not being able to handle routing intra-row East-West traffic, there is little incentive to mesh k8s nodes.

Do correct the above wherever I am wrong, please.

[1] https://grafana.wikimedia.org/d/000000607/cluster-overview?orgId=1&var-site=eqiad&var-cluster=kubernetes&var-instance=All&var-datasource=thanos&viewPanel=84&from=now-30d&to=now

Even in the legacy setup (pre row e/f) adding new nodes requires manual error-prone gerrit changes like this one 35b0c9a4832068d08

Yes I agree totally. One thing I had previously looked at, and thought it was not an option for us, is the Juniper "dynamic neighbor" feature for BGP peers. Basically this allows you to define an IP range/subnet as the peer, after which it will form a peering session with any remote device on the subnet that tries to establish a session. I'd previously tested this and it works great, however until recently it wouldn't work as you could only configure a single peer-as for the "dynamic-peer", and we have various peer ASNs connected off all our private subnets. I've looked again at this today though, and Juniper have introduced a "peer as-list" command in version 20 which seems to allow for it:

https://phabricator.wikimedia.org/P27787

So we could basically add that kind of config to all top-of-rack switches by default, and no further per-server configuration would be needed on the network side. You can configure a password on the sessions too if that tightens it up security wise. I'd need to run that by @ayounsi but it seems to me to be a good way forward.

That doesn't help on the Calico side, which needs to know the specific IP of the top-of-rack. It would be great if there was some way to automagically have it use the first IPs on the local subnet range. There might be some way to always have the same "well known" peer IP on all top of racks too, but there are complexities there we'd need to work through.

let me recap a bit on why we don't yet peer k8s nodes in a mesh for posterity

Thanks for that. Certainly it could be considered for rows A-D, but if the east/west traffic is low I agree the benefit is slight. My preference would be to lean into the new, routed design like in rows E/F instead. As we migrate more rows to that setup the sub-optimal traffic flows go away. And fwiw modern switching silicon can route (L3) as fast as switch (L2), so there is no concern there in terms of performance.

Quick question about how to proceed. Would it make sense to start testing adding manual labels in the ml-serve-eqiad cluster (since we have new E/F nodes there) to see if everything works as expected etc..? After this verification we could start thinking about how/where to get the node label info, and how to better share/maintain the calico bgp configs etc..

If the idea is ok, I'd propose to use labels like wikimedia.org/node-location == lsw1-f3-eqiad as it was mentioned beforehand. If you don't like the idea tell me what you prefer, no strong opinions :)

@elukey yes I think that makes sense, no need to hold off on testing. Your suggested label naming makes sense so let's go with that.

I'm also planning to also adjust the switch configs as described in my previous post. I've lab tested it, but when we go to apply to the switches there will be a brief interruption to existing BGP sessions. We'll need to discuss the impact and work out how to sequence the change to avoid service disruption. Hopefully won't be anything too tricky, I'll touch base with you when the change is prepped and we can discuss.

Thanks!

Even in the legacy setup (pre row e/f) adding new nodes requires manual error-prone gerrit changes like this one 35b0c9a4832068d08

Yes I agree totally. One thing I had previously looked at, and thought it was not an option for us, is the Juniper "dynamic neighbor" feature for BGP peers. Basically this allows you to define an IP range/subnet as the peer, after which it will form a peering session with any remote device on the subnet that tries to establish a session. I'd previously tested this and it works great, however until recently it wouldn't work as you could only configure a single peer-as for the "dynamic-peer", and we have various peer ASNs connected off all our private subnets. I've looked again at this today though, and Juniper have introduced a "peer as-list" command in version 20 which seems to allow for it:

https://phabricator.wikimedia.org/P27787

So we could basically add that kind of config to all top-of-rack switches by default, and no further per-server configuration would be needed on the network side. You can configure a password on the sessions too if that tightens it up security wise. I'd need to run that by @ayounsi but it seems to me to be a good way forward.

This is awesome!!! Thanks for that, TIL.

That doesn't help on the Calico side, which needs to know the specific IP of the top-of-rack. It would be great if there was some way to automagically have it use the first IPs on the local subnet range. There might be some way to always have the same "well known" peer IP on all top of racks too, but there are complexities there we'd need to work through.

The main issue, as I see it currently, is getting the local subnet data in helm. This isn't executed on the level of the k8s nodes, but rather the deployment server (deploy1002), so we can't use anything the node (or puppet) knows about itself. We 'd need e.g. to dump the relevant section of network/data/data.yaml from puppet into /etc/helm in some nicely addressable hash (keyed by dc+rack perhaps?) and use netbox puppet location data to match.

I do see an alternative (which you might hate). Creating "fake" k8s objects to represent the leafs and then use peerselectors and nodeselector to match up nodes to "fake" nodes.

Both require some work and I am not yet seeing very clearly all the inevitable shortcomings they 'll have. Some that come to mind are e.g. keeping the "fake" k8s node objects updated. We 'll need some experimentation.

And fwiw modern switching silicon can route (L3) as fast as switch (L2), so there is no concern there in terms of performance.

TIL. Thanks!

Quick question about how to proceed. Would it make sense to start testing adding manual labels in the ml-serve-eqiad cluster (since we have new E/F nodes there) to see if everything works as expected etc..? After this verification we could start thinking about how/where to get the node label info, and how to better share/maintain the calico bgp configs etc..

Yes, definitely.

If the idea is ok, I'd propose to use labels like wikimedia.org/node-location == lsw1-f3-eqiad as it was mentioned beforehand. If you don't like the idea tell me what you prefer, no strong opinions :)

I am fine with whatever for experimentation, provided they don't stick around.

Down the road, we already have

failure-domain.beta.kubernetes.io/region: eqiad
failure-domain.beta.kubernetes.io/zone: row-d

for all wikikube nodes. These aren't nicely automated yet (which actually makes it flexible right now that we are discussing this) but rather just some yaml data under hosts/kubernetes*.yaml. I don't see ml-serve having any yet, which means we can just add them right now.

Those are well known and standardized [1]. They are also deprecated and meant to be replaced by topology.kubernetes.io/zone and topology.kubernetes.io/region , tracked in T270191.
I 'd much rather we ended up with the latter ones for this eventually and not something that we invented. region and zone anyway map nicely to DC and L2 equipment in both the legacy and the e/f row cases in my mind.

[1] https://kubernetes.io/docs/reference/labels-annotations-taints

If the idea is ok, I'd propose to use labels like wikimedia.org/node-location == lsw1-f3-eqiad as it was mentioned beforehand. If you don't like the idea tell me what you prefer, no strong opinions :)

I am fine with whatever for experimentation, provided they don't stick around.

Down the road, we already have

failure-domain.beta.kubernetes.io/region: eqiad
failure-domain.beta.kubernetes.io/zone: row-d

for all wikikube nodes. These aren't nicely automated yet (which actually makes it flexible right now that we are discussing this) but rather just some yaml data under hosts/kubernetes*.yaml. I don't see ml-serve having any yet, which means we can just add them right now.

Those are well known and standardized [1]. They are also deprecated and meant to be replaced by topology.kubernetes.io/zone and topology.kubernetes.io/region , tracked in T270191.
I 'd much rather we ended up with the latter ones for this eventually and not something that we invented. region and zone anyway map nicely to DC and L2 equipment in both the legacy and the e/f row cases in my mind.

[1] https://kubernetes.io/docs/reference/labels-annotations-taints

Ack I think that it makes sense. Just to understand the mapping, at the moment we have something like:

lsw1-e2-eqiad-ipv4:
  node: ml-serve1005.eqiad.wmnet
  asNumber: 14907
  peerIP: 10.64.131.1

IIUC the peerIP is specific to the L2 device / ToR that the host is connected to, meanwhile topology.kubernetes.io/zone and topology.kubernetes.io/region seem less granular (namely I'd see them usable to map to rows not specific switches). I am missing something for sure, can you add an example of labeling for say ml-serve1005?
Should I use something like:

failure-domain.beta.kubernetes.io/region: eqiad
failure-domain.beta.kubernetes.io/zone: row-e2 (or lsw1-e2)

or

topology.kubernetes.io/zone: row-e2 (or lsw1-e2)
topology.kubernetes.io/region: eqiad

?

Ack I think that it makes sense. Just to understand the mapping, at the moment we have something like:

lsw1-e2-eqiad-ipv4:
  node: ml-serve1005.eqiad.wmnet
  asNumber: 14907
  peerIP: 10.64.131.1

This mapping, while fine for experimentation, won't really scale though. So, 1 thing we could do is replace node with nodeSelector and match on region and zone. e.g.

lsw1-e2-eqiad-ipv4: #Though is does NOT need to be unique if we don't have to specify a peerIP
  nodeSelector:  topology.kubernetes.io/region=eqiad && topology.kubernetes.io/zone: rack-e2
  asNumber: 14907
  peerIP: 10.64.131.1

This is already an improvement as we no longer need to create 1 stanza per k8s node (left to be tested ofc)

IIUC the peerIP is specific to the L2 device / ToR that the host is connected to, meanwhile topology.kubernetes.io/zone and topology.kubernetes.io/region seem less granular (namely I'd see them usable to map to rows not specific switches). I am missing something for sure, can you add an example of labeling for say ml-serve1005?

Definitely the peerIP is unique to the switch. While I envision topology.kubernetes.io/region to only have 2 values (eqiad, codfw) for the foreseeable future, I think topology.kubernetes.io/zone will need to support both legacy rows and e/f racks. So the valid values set would be row-a, row-b, row-c, row-c, rack-e1, rack-e2, rack-e3...rack-f1, rack-f2.... This isn't really difficult to do, I already have the 1st part in https://gerrit.wikimedia.org/r/c/operations/puppet/+/791597 and I think I got a good idea on how to do the 2nd part (the data is anyway there thanks to @jbond, it's just a bit of if clausing and mangling to make sure it's a consistent)

Should I use something like:

failure-domain.beta.kubernetes.io/region: eqiad
failure-domain.beta.kubernetes.io/zone: row-e2 (or lsw1-e2)

or

topology.kubernetes.io/zone: row-e2 (or lsw1-e2)
topology.kubernetes.io/region: eqiad

?

topology.kubernetes.io aren't supported per 1.17, see T270191#7270115. For now we will have to go with the failure-domain.beta.kubernetes.io label scheme. As far as zone and region keys go, refer to the valid value set I outlined above.

Now, the diificult part is to NOT need peerIP in the BGPPeer configuration. For which to happen, we need to create the fake node objects I was outlining above. In that case we would be limiting the bgpPeer setup to something like

leafs:
  nodeSelector:  topology.kubernetes.io/region=eqiad && topology.kubernetes.io/zone: rack-e2  # I think I am simplifying the syntax for readability, don't copy paste it.
  asNumber: 14907
  peerSelector: node.kubernetes.io/instance-type=leaf-switch && topology.kubernetes.io/region=eqiad && topology.kubernetes.io/zone: rack-e2 #Some care to be taken that the leafswitch doesn't select itself, probably a node.kubernetes.io/instance-type!=leaf-switch equivalent 

That will reduce the complexity of the problem from keeping NxM mappings (nodes to leaf switches), to just populating M fake kubernetes nodes alongside their IPs

Since I never tried the above (although it seems supported), alternatively, we will fallback to specifying the peerIPs and still stick with a size M mapping of leaf switches to a set of peerIP+nodeSelector stanzas. In this case, @cmooney's point about the leaf switch being the 1st one in the subnet might be useful.

Note, that this doesn't preclude you from proceeding with your testing. Proceed with it, you might uncover something that make the entirey of the plans I outlined above unfeasible.

We should be aware that changing the topology annotations also changes scheduling behavior. As of now, the scheduler will try to schedule Pods of the same Replicaset across different rows (as zone contains just the row). Changing zone do include the rack will make each rack unique, so we lower the possibility of Replicatsets Pods to span multiple rows. That might have unwanted implications in case of power or network issues on one row.

Regarding the "fake nodes": I think that could be done with adding the leafs as GlobalNetworkSet to the K8s/Calico API. That should make them easily selectable via peerSelectors without creating the confusion fake nodes would create.

That might have unwanted implications in case of power or network issues on one row.

That's fine, we're moving from a per row redundancy model to a per rack one as the new network design (used in E/F) doesn't have the shortcomings of the older rows.

Added the proposed node labels to ml-serve-eqiad via T308418#7930118. At this point I'll wait to see what strategy is best to pick between GlobalNetworkSet and fake nodes, and then we'll be able to test on ml-serve.

In the meantime, should I do something like the following for the "global" BGPPeers configs?

BGPPeers:
  cr1-eqiad-ipv4:
    asNumber: 14907
    peerIP: "208.80.154.196"
    nodeSelector: (failure-domain.beta.kubernetes.io/region == eqiad) && (failure-domain.beta.kubernetes.io/zone == row-a || failure-domain.beta.kubernetes.io/zone == row-b || failure-domain.beta.kubernetes.io/zone == row-c || failure-domain.beta.kubernetes.io/zone == row-d)

The above would avoid hosts in row E/F to try to establish BGP sessions with the core routers.

Last but not least - should the control plane nodes, running now calico, be labeled with failure-domain labels? Afaics they are not labeled on the ServiceOps clusters (but I could be wrong), and the labels will be needed to match the restriction above for sessions to core routers. Thoughts?

Regarding the "fake nodes": I think that could be done with adding the leafs as GlobalNetworkSet to the K8s/Calico API. That should make them easily selectable via peerSelectors without creating the confusion fake nodes would create.

Reading that, I am under the impression it won't work cause it only applies for Network policies. Can't hurt to try though.

Added the proposed node labels to ml-serve-eqiad via T308418#7930118. At this point I'll wait to see what strategy is best to pick between GlobalNetworkSet and fake nodes, and then we'll be able to test on ml-serve.

In the meantime, should I do something like the following for the "global" BGPPeers configs?

BGPPeers:
  cr1-eqiad-ipv4:
    asNumber: 14907
    peerIP: "208.80.154.196"
    nodeSelector: (failure-domain.beta.kubernetes.io/region == eqiad) && (failure-domain.beta.kubernetes.io/zone == row-a || failure-domain.beta.kubernetes.io/zone == row-b || failure-domain.beta.kubernetes.io/zone == row-c || failure-domain.beta.kubernetes.io/zone == row-d)

The above would avoid hosts in row E/F to try to establish BGP sessions with the core routers.

+1

Last but not least - should the control plane nodes, running now calico, be labeled with failure-domain labels? Afaics they are not labeled on the ServiceOps clusters (but I could be wrong), and the labels will be needed to match the restriction above for sessions to core routers. Thoughts?

They have the node-role.kubernetes.io/master:NoSchedule taint so nothing (aside from calico-node) will be scheduled on them, we can skip them for now while we test our setup, I expect (famous last words) it won't hurt (much) if they don't peer with cr{1,2}.

Plus, they are VMs and we have the same problem we have with the kask dedicated nodes (also VMs). Netbox doesn't have (nor can it have per my understanding) data about their placement. We need to figure out a strategy for having VMs as k8s nodes.

Regarding the "fake nodes": I think that could be done with adding the leafs as GlobalNetworkSet to the K8s/Calico API. That should make them easily selectable via peerSelectors without creating the confusion fake nodes would create.

Reading that, I am under the impression it won't work cause it only applies for Network policies. Can't hurt to try though.

Yeah. Some other docs suggest they can be used in selector fields an general, though.

They have the node-role.kubernetes.io/master:NoSchedule taint so nothing (aside from calico-node) will be scheduled on them, we can skip them for now while we test our setup, I expect (famous last words) it won't hurt (much) if they don't peer with cr{1,2}.

Not 100% true currently (https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/777364). I guess we need to make sure/keep in mind do not run anything with a Pod IP on the masters then (e.g. only Pods in the hosts network namespace), right?

Plus, they are VMs and we have the same problem we have with the kask dedicated nodes (also VMs). Netbox doesn't have (nor can it have per my understanding) data about their placement. We need to figure out a strategy for having VMs as k8s nodes.

It's on the TODO: T262446: Import row information into Netbox for Ganeti instances probably after Netbox is upgraded.

Regarding the "fake nodes": I think that could be done with adding the leafs as GlobalNetworkSet to the K8s/Calico API. That should make them easily selectable via peerSelectors without creating the confusion fake nodes would create.

Reading that, I am under the impression it won't work cause it only applies for Network policies. Can't hurt to try though.

Yeah. Some other docs suggest they can be used in selector fields an general, though.

They have the node-role.kubernetes.io/master:NoSchedule taint so nothing (aside from calico-node) will be scheduled on them, we can skip them for now while we test our setup, I expect (famous last words) it won't hurt (much) if they don't peer with cr{1,2}.

Not 100% true currently (https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/777364).

Oh, good point. +1ed the change.

I guess we need to make sure/keep in mind do not run anything with a Pod IP on the masters then (e.g. only Pods in the hosts network namespace), right?

Overall yes. There are few advantages and a lot of caveats in running workloads on masters. If we do create an exception, it should be well communicated and documented.

I merged two changes for the ml-serve-eqiad cluster, and now the concerns expressed in T306649#7881940 should be gone:

elukey@ml-serve1005:~$ ss -tulpna | grep ":179" | sort
tcp   ESTAB     0      0                       10.64.131.3:43807                   10.64.131.1:179         
tcp   ESTAB     0      0      [2620:0:861:10a:10:64:131:3]:60061           [2620:0:861:10a::1]:179         
tcp   LISTEN    0      8                           0.0.0.0:179                         0.0.0.0:*           
tcp   LISTEN    0      8                              [::]:179                            [::]:*

ml-serve1005 is in row E and it doesn't peer with cr{1,2} anymore. Of course the configuration needs to be improved and automated as described above, but for the moment:

  • ml-serve100[5-8] peer only with their ToRs in row E/F
  • ml-serve100[1-4] and ml-serve-ctrl100[1,2] peer with cr{1,2}-eqiad

I merged two changes for the ml-serve-eqiad cluster, and now the concerns expressed in T306649#7881940 should be gone:

elukey@ml-serve1005:~$ ss -tulpna | grep ":179" | sort
tcp   ESTAB     0      0                       10.64.131.3:43807                   10.64.131.1:179         
tcp   ESTAB     0      0      [2620:0:861:10a:10:64:131:3]:60061           [2620:0:861:10a::1]:179         
tcp   LISTEN    0      8                           0.0.0.0:179                         0.0.0.0:*           
tcp   LISTEN    0      8                              [::]:179                            [::]:*

ml-serve1005 is in row E and it doesn't peer with cr{1,2} anymore. Of course the configuration needs to be improved and automated as described above, but for the moment:

  • ml-serve100[5-8] peer only with their ToRs in row E/F
  • ml-serve100[1-4] and ml-serve-ctrl100[1,2] peer with cr{1,2}-eqiad

Super!

With T270191: Add kubernetes 1.17+ topology annotations I've changed the zone of k8s ganeti workers to to their respective ganeti cluster and group, like:

failure-domain.beta.kubernetes.io/region: eqiad
failure-domain.beta.kubernetes.io/zone: ganeti-eqiad-a

So I changed the default nodeSelector we use in for the core router BGP peers in calico to:

nodeSelector: (failure-domain.beta.kubernetes.io/region == 'eqiad' && failure-domain.beta.kubernetes.io/zone in { 'row-a', 'row-b', 'row-c', 'row-d', 'ganeti-eqiad-a', 'ganeti-eqiad-b', 'ganeti-eqiad-c', 'ganeti-eqiad-d' }) || has(node-role.kubernetes.io/master)

https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/868029/1
https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/868030/1

Change 887945 had a related patch set uploaded (by Ayounsi; author: Ayounsi):

[operations/deployment-charts@master] [WIP] Refactor and centralize BGPpeer config

https://gerrit.wikimedia.org/r/887945

Change 889069 had a related patch set uploaded (by Ayounsi; author: Ayounsi):

[operations/puppet@production] k8s FERM: allow gateway and infra ranges by default

https://gerrit.wikimedia.org/r/889069

Circling back on the network side config now that there are a few patches out to improve the server side.

Focusing on the first pain-point: having to manually define the k8s cluster nodes in either Homer's sites.yaml or more recently for each relevant ToR switches and then having to run Homer to push the configuration change. Time consuming and error prone.

I see two main path forward (for k8s and server BGP in general).

First is to leverage dynamic peers as demoed by @cmooney. The feature is supported on SONiC too (per the doc) and other vendors so no risk of vendor lock here. It's a great feature too but I wouldn't want to go all in with it.
The suggestion is to use a mix of statically configured BGP peers for critical and non frequently changing services (eg. LVS or anycast) and dynamic peers for any service willing to go that way.
The benefits I see are:

  • Reduced blast radius (miss-configuration or compromised server don't risk impacting static peers)
  • Keep the improved monitoring (BGP down alerts) for statically configured peers
  • Flexibility for service owners needing such flexibility
  • Quick to implement (we could

However dynamic peers do have limitations, and service owners should be aware/ok with them:

  • shared BGP configuration (on Cathal's example P27787 the policy-statement SERVER_BGP_IN is shared between all dynamic servers)
  • No export filtering possible on AS# for prefixes sent from routers to nodes (if needed filters would need to happen on the server side, which is not possible with Calico)
    • This could cause some conflicts down the road if used by many different types of servers, but is fine for k8s)
  • The filtering is done on AS# only (eg. if a node claims to be from wikikube it will automatically have all the rights of such cluster)
    • Using BGP key (supported by Calico but not tested) reduces the risk of active attack but not of server miss-configuration
  • No network side monitoring for BGP sessions status
  • BFD behavior still needs to be tested (+multi-hop)
  • Can't use Bird's external feature (see https://gerrit.wikimedia.org/r/c/operations/homer/public/+/827950), less an issue for service owners than for Netops

A possible rollout would be to implement it first on row E/F ToR switches (simpler configuration, less nodes) then the core routers.

The 2nd proposal, more complex to implement (and not incompatible with the first one) is to configure the (or most) statically configured peers from Netbox.
Based on a given criteria (eg. hostname + status, or custom field, TBD) Homer could pull the list of servers physically connected to a ToR (or on a given A-D row) and add the relevant BGP configuration to the router/ToR. This would add a safeguard (homer run) which can also be seen as a pain point.

Change 899563 had a related patch set uploaded (by JMeybohm; author: JMeybohm):

[operations/deployment-charts@master] admin_ng: Remove chart version pinning

https://gerrit.wikimedia.org/r/899563

Change 899563 merged by jenkins-bot:

[operations/deployment-charts@master] admin_ng: Remove chart version pinning

https://gerrit.wikimedia.org/r/899563

Change 903174 had a related patch set uploaded (by Ayounsi; author: Ayounsi):

[operations/cookbooks@master] Add cookbook to configure router's BGP sessions to k8s hosts

https://gerrit.wikimedia.org/r/903174

I pondered multiple options for the Netbox server_bgp custom field, feedback from ServiceOps welcome based on their workflows

  • No custom field: only configure it based on the host's status. Simpler but less flexibility to have a host up but with no BGP sessions configured. I'd prefer to go that way if it's possible.
  • True/False: current behavior, it decouples the host's status from its BGP status while still hiding the complexity from the user
  • Dropdown with BGP group: would allow to move HOSTNAMES_TO_GROUPS from the cookbook to Netbox. More flexible (no need for a cookbook update), but also more risk of operator mistake

Using this cookbook also means moving away from Homer for managing those BGP peers. It is possible to have both but at an increased engineering cost for Junos sorting reasons as well as duplicated code.
The main impact is that discrepancies between what's configured on the routers and what's set in Netbox (via server_bgp) won't be as visible than before. To me that's ok for narrow scope items like those.
It will most likely lead to some Icinga BGP status alerting (eg. sessions configured on the router while the host is being worked on) but also less "homer daily diff" emails for Netops. So still a win.
I also need to open a larger scope task to move the Icinga BGP status check to Alertmanager/Prometheus/Grafana for more fine grained alerting configuration and routing.

Once deemed production ready, there are multiple possible improvements:

  • the cookbook could be called from other k8s related cookbooks
  • the cookbook (or a different one) could modify the server_bgp Netbox custom field
  • the server_bgp Netbox custom field could drive some server's Bird's config (eg. "standby" mode, where sessions are up but no prefixes are advertised).
  • the cookbook could be extended to other types of servers (eg. DoH/Anycast, /cc @ssingh ), but LVS are out of the picture (at least for now) as they don't need to change regularly and are critical

Next steps here are:

  1. Get feedback/review on the above cookbook and logic
  2. Test the cookbook more
  3. Remove the management of groups Kubestage4/6 from Homer
  4. Deploy the cookbook

Example outputs:

debug output; server_bgp set to False; peer already configured
$ cookbook -d -c config.yaml sre.network.server-bgp kubestage1003
DRY-RUN: Executing cookbook sre.network.server-bgp with args: ['kubestage1003']
DRY-RUN: START - Cookbook sre.network.server-bgp for server kubestage1003
DRY-RUN: Configuration needed: False
DRY-RUN: Target router(s): ['cr1-eqiad.wikimedia.org', 'cr2-eqiad.wikimedia.org']
DRY-RUN: Fetching BGP status for peer 10.64.16.55
DRY-RUN: Executing commands ['show bgp neighbor 10.64.16.55 | display json'] on 1 hosts: cr1-eqiad.wikimedia.org
DRY-RUN: Fetching BGP status for peer 2620:0:861:102:10:64:16:55
DRY-RUN: Executing commands ['show bgp neighbor 2620:0:861:102:10:64:16:55 | display json'] on 1 hosts: cr1-eqiad.wikimedia.org
DRY-RUN: Executing commands ['configure exclusive;delete protocols bgp group Kubestage4 neighbor 10.64.16.55;delete protocols bgp group Kubestage6 neighbor 2620:0:861:102:10:64:16:55;show|compare;rollback;exit'] on 1 hosts: cr1-eqiad.wikimedia.org
----- OUTPUT of 'configure exclus...re;rollback;exit' -----
Entering configuration mode
[edit protocols bgp group Kubestage4]
-     neighbor 10.64.16.55 {
-         description kubestage1003;
-     }
[edit protocols bgp group Kubestage6]
-     neighbor 2620:0:861:102:10:64:16:55 {
-         description kubestage1003;
-     }
load complete
Exiting configuration mode
================
100.0% (1/1) success ratio (>= 100.0% threshold) for command: 'configure exclus...re;rollback;exit'.
100.0% (1/1) success ratio (>= 100.0% threshold) of nodes successfully executed all commands.
==> Commit the above change?
Type "go" to proceed or "abort" to interrupt the execution
> abort
debug output; server_bgp set to True; peer already configured
$ cookbook -d -c config.yaml sre.network.server-bgp kubestage1003
DRY-RUN: Executing cookbook sre.network.server-bgp with args: ['kubestage1003']
DRY-RUN: START - Cookbook sre.network.server-bgp for server kubestage1003
DRY-RUN: Configuration needed: {'group': 'Kubestage', 'v4_only': False}
DRY-RUN: Target router(s): ['cr1-eqiad.wikimedia.org', 'cr2-eqiad.wikimedia.org']
DRY-RUN: Fetching BGP status for peer 10.64.16.55
DRY-RUN: Executing commands ['show bgp neighbor 10.64.16.55 | display json'] on 1 hosts: cr1-eqiad.wikimedia.org
DRY-RUN: Fetching BGP status for peer 2620:0:861:102:10:64:16:55
DRY-RUN: Executing commands ['show bgp neighbor 2620:0:861:102:10:64:16:55 | display json'] on 1 hosts: cr1-eqiad.wikimedia.org
DRY-RUN: Nothing to do on cr1-eqiad.wikimedia.org 🎉
DRY-RUN: Fetching BGP status for peer 10.64.16.55
DRY-RUN: Executing commands ['show bgp neighbor 10.64.16.55 | display json'] on 1 hosts: cr2-eqiad.wikimedia.org
DRY-RUN: Fetching BGP status for peer 2620:0:861:102:10:64:16:55
DRY-RUN: Executing commands ['show bgp neighbor 2620:0:861:102:10:64:16:55 | display json'] on 1 hosts: cr2-eqiad.wikimedia.org
DRY-RUN: Nothing to do on cr2-eqiad.wikimedia.org 🎉
DRY-RUN: END (PASS) - Cookbook sre.network.server-bgp (exit_code=0) for server kubestage1003
no debug output; server_bgp set to True; peer already configured
$ cookbook -c config.yaml sre.network.server-bgp kubestage1003
START - Cookbook sre.network.server-bgp for server kubestage1003
Nothing to do on cr1-eqiad.wikimedia.org 🎉
Nothing to do on cr2-eqiad.wikimedia.org 🎉
END (PASS) - Cookbook sre.network.server-bgp (exit_code=0) for server kubestage1003

Personally I think it's a big conceptual change to introduce a second separate automation-pipeline for the network.

For instance now we have the cookbook for configuring interfaces, for convenience, but Homer will also add the interface config. Homer is still the "authoritative" source for configuration, and while we do have manual elements I was operating on the premise that the goal was to eventually have Homer do everything.

Aside from duplication of code what are the blockers to having the Kubernetes groups also in Homer? My worry would be fragmentation in time, when configuring a new or replacement router became, instead of just "run homer" against it, "run these 6 cookbooks, and homer". And the fact you mention that this makes it more difficult to check running versus desired state (as we do currently with the Homer diffs).

I'm not really opposed to using cookbooks, to augment Homer and provide an easy way to add/remove things. Or even ditching Homer and moving completely to cookbooks or a different workflow. But I do worry about changing our automation to "this system configures features X and Y", "this one does Z". Seems convoluted.

On the Netbox side I'm happy with the current status, or having it as a dropdown. I think it's good to keep it separate from the server lifecycle state anyway.

Aside from duplication of code what are the blockers to having the Kubernetes groups also in Homer?

The cookbook adds the new BGP group items (neighbors) at the bottom of the already configured peers. On the other hand Homer have them ordered (either sorted or in the YAML order). So to have them both work in parallel, we would need the cookbook to get all the neighbors of a given group and insert the new one where needed.
Not insurmountable, but extra engineering work I'm not sure is worth it.
Note that in Junos interfaces are the only exception, where Junos always sorts new items.

I basically see this BGP peer management more as a server operations than a network operations. A cookbook will more easily integrate in ServiceOps (and other server teams) workflows and bring multiple advantages, especially speed and limited blast radius.

And the fact you mention that this makes it more difficult to check running versus desired state (as we do currently with the Homer diffs).

The flip side here is SREs changing the custom field or server's status and not running Homer (like we already have in other places). Causing at best daily diffs for netops to cleanup or blocking a different team not wanting to push a change not related to them.
That's why I think cookbook + better alerting would be more suited for this use-case.

My worry would be fragmentation in time, when configuring a new or replacement router became, instead of just "run homer" against it,

That's a valid concern. For this specific usecase it's a tradeoff between day to day operations vs. infrequent changes, so I'm not worried, we can one-off copy existing configuration or run the cookbook for all servers in the impacted row/rack (at it supports cumin selectors).
More globally there is a risk that we "lose track" of what's configured on the routers (eg. a host is being decommissioned and the cookbook isn't run, leaving down sessions). I think this will be solved or at least mitigated with improved alerting (Netbox report and/or Prometheus.

Personally I think it's a big conceptual change to introduce a second separate automation-pipeline for the network.
But I do worry about changing our automation to "this system configures features X and Y", "this one does Z".

I agree. Here as the scope is specific enough (BGP peers of selected groups) I think the benefits outweigh the disadvantages of the alternative (having it implemented in Homer only), but I agree that there is a risk of drift.

Also keep in mind that the current cookbook only approach doesn't lock us down. we can implement it in Homer (in addition or instead of the cookbook) later on if needed.

The cookbook also fits better a possible future gnmi/restconf approach of automation.

Change 887945 merged by jenkins-bot:

[operations/deployment-charts@master] Refactor and centralize BGPpeer config

https://gerrit.wikimedia.org/r/887945

@ayounsi thanks for the response.

Overall I've no objection so let's proceed. I agree in terms of adding/removing the peerings it's more of a service-ops task, ideally there is no need for us to get involved.

As a general principle I think we should try to avoid having too many different systems that apply config. But this one change isn't that big a deal. Let's see how things evolve. If we have more cases like this as time goes on it might be worth investigating what the pros/cons of each method is, and investigate if there is a single tool that could cover all use cases.

Change 889069 merged by Ayounsi:

[operations/puppet@production] k8s FERM: allow gateway and infra ranges by default

https://gerrit.wikimedia.org/r/889069

ayounsi changed the task status from Open to Stalled.Apr 5 2023, 7:59 AM

Marking it as stalled until the cookbook is reviewed/merged.

Change 976749 had a related patch set uploaded (by Ayounsi; author: Ayounsi):

[operations/software/homer/deploy@master] Expose Netbox's BGP servers to Homer

https://gerrit.wikimedia.org/r/976749

Change 903174 abandoned by Ayounsi:

[operations/cookbooks@master] Add cookbook to configure router's BGP sessions to k8s hosts

Reason:

Replaced by I600494399bd08c90840c823ca37dfbb9081eaa82

https://gerrit.wikimedia.org/r/903174

Mentioned in SAL (#wikimedia-operations) [2023-12-01T12:17:57Z] <XioNoX> add BGP custom field to Netbox - T306649

Change 979381 had a related patch set uploaded (by Ayounsi; author: Ayounsi):

[operations/homer/public@master] Get the server's BGP peer info from Netbox

https://gerrit.wikimedia.org/r/979381

Change 976749 merged by Ayounsi:

[operations/software/homer/deploy@master] Expose Netbox's BGP servers to Homer

https://gerrit.wikimedia.org/r/976749

Change 979381 merged by jenkins-bot:

[operations/homer/public@master] Get the server's BGP peer info from Netbox

https://gerrit.wikimedia.org/r/979381

Change 989136 had a related patch set uploaded (by Cathal Mooney; author: Cathal Mooney):

[operations/software/homer/deploy@master] Add other vairant of QFX5120 to L3_SWITCHES_MODELS

https://gerrit.wikimedia.org/r/989136

Change 989136 merged by Cathal Mooney:

[operations/software/homer/deploy@master] Add other vairant of QFX5120 to L3_SWITCHES_MODELS

https://gerrit.wikimedia.org/r/989136