Page MenuHomePhabricator

neutron: cloudnet nodes use VRRP over VXLAN to instrument HA and they require to be on the same subnet
Closed, DeclinedPublic

Description

While doing T316284: Replace cloudnet100[34] with cloudnet100[56] we discovered that neutron hosts running the l3 agent, as configured today, communicate using VRRP over VXLAN to instrument the HA.

The setup, as is, forces them to be on the same Vlan and subnet.

In the config, this seems to be:

/etc/neutron/neutron.conf
# allow vxlan use for VRRP without
# enabling tenant created networks
l3_ha_network_type = vxlan

Evaluate if this is desirable or not.

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald Transcript
This comment was removed by cmooney.

Ok yeah I see what is going on. Cloudnet1005 is running VXLAN over UDP port 8472 (IANA assigned to Cisco OTV protocol, a pre-cursor to VXLAN). And yes it's wrapping VRRP frames within that.

cmooney@wikilap:~$ tshark -r cloudnet1005.pcap -c 1 -V -d udp.port==8472,vxlan | egrep "^[A-Z]"
Frame 1: 104 bytes on wire (832 bits), 104 bytes captured (832 bits)
Ethernet II, Src: Broadcom_cd:51:e0 (5c:6f:69:cd:51:e0), Dst: IPv4mcast_01 (01:00:5e:00:00:01)
Internet Protocol Version 4, Src: 10.64.151.3, Dst: 224.0.0.1
User Datagram Protocol, Src Port: 38837, Dst Port: 8472
Virtual eXtensible Local Area Network
Ethernet II, Src: fa:16:3e:8b:e0:e9 (fa:16:3e:8b:e0:e9), Dst: IPv4mcast_12 (01:00:5e:00:00:12)
Internet Protocol Version 4, Src: 169.254.192.66, Dst: 224.0.0.18
Virtual Router Redundancy Protocol

It's using VXLAN in multicast mode however, so destination IP is 224.0.0.1. In theory that could work across subnets, if we implemented routed multicast, but that really is a lot of work to set up and support. Without that participating nodes need to be on the same Vlan, in which case the multicasts get treated as broadcasts (in the absence of IGMP snooping), and get sent to every host.

That's not terribly efficient. @arturo how many nodes are participating in this, i.e. sending VXLAN packets to each other? If it's just the two cloudnet hosts then ideally they could just know to send unicast packets to the fixed IP of the other. And thus not need to be on the same vlan. I'll try to dig into the OpenStack/Neutron docs to see what are the options here / typical deployment solutions.

On the face of it the VXLAN does not seem to be adding much, i.e. it's not reducing the requirement for L2 adjacency versus just running plain VRRP.

@aborrero thanks.

Reading briefly through the docs I have a better understanding there of what's going on. Looking through the options for the linuxbridge_agent I don't think there is any other way to run the VXLAN without using multicast.

Overall we'd be fairly reluctant to have to introduce routed multicast on the physical network just to support this (after having been able to remove it everywhere else on the network in recent years). So the only option in the short term is keep the stretched Vlan1118 in place and the cloudnet hosts on it.

One thing that will help, is as we move cloud hosts from the stretched prod-realm Vlan (1118 / cloud-hosts1-eqiad) to the rack-specific ones there will be less ports connected to it. And thus the impact of sending the VXLAN packets everywhere will be reduced in time. I gather the hypervisor hosts do not send VXLAN-encap packets, or need to receive them? It's only the cloudnet's that do that for the failure detection?

If that's the case, and in future Vlan1118 only has the cloudnet hosts on it, then the issue is largely mitigated, although longer term we should still explore options.

I gather the hypervisor hosts do not send VXLAN-encap packets, or need to receive them? It's only the cloudnet's that do that for the failure detection?

If that's the case, and in future Vlan1118 only has the cloudnet hosts on it, then the issue is largely mitigated, although longer term we should still explore options.

I think this is correct. Hypervisors don't have keepalived or VRRP running.

But we do have keepalived running on cloudgw servers. So we may want to review them as well.

But we do have keepalived running on cloudgw servers. So we may want to review them as well.

Ok thanks. Ultimately if it's only a small number of hosts that need this then leaving them on the shared Vlan, and moving the rest to the per-rack subnets, gets us most of the way there.

The main inefficiency with the current setup is that just a few hosts need to receive the VXLAN multicasts, but most of the cloud hosts are getting them, as they all sit on the same prod-realm Vlan.

As you seen the provisioning logic will add new hosts to the per-rack vlans by default (if a specific vlan is not requested). So as we move / refresh nodes they'll be taken out of this Vlan, and won't get the multicasts.

@aborrero just going through some tasks. I think perhaps we can close this.

How I'd sum it up:

  • If we could disable this and use a simple unicast ICMP ping or something, it would definitely be better.
  • But it's not a disaster if we keep the current setup
    • The VXLAN encap is a quirk for sure, but ultimately it's multicast frames over a layer-2 Vlan, so no different than "raw" VRRP to the network
    • As time goes on we'll continue to move the prod realm connection of cloud hosts from the stretched 'cloud-hosts1-eqiad' vlan to the rack-specific ones (cloud-hosts1-d5 etc).
      • This means eventually we can have a scenario where only the cloudnets are on Vlan1118, and thus the "broadcast" of these keepalives is constrained to those two hosts.

TL;DR not desirable, but not a total disaster. If there is an easy alternative then great, otherwise it can be supported (at least medium-term).

@aborrero feel free to close this one if it's not being worked on, the status quo is not perfect but is workable.

To sum up the cloudnet hosts use VRRP to provide a VIP, which they need to be on the same Vlan to do. The cloudgw has a static route for all ranges it needs to send via netron/cluodnet with this VIP as the next-hop.

The way to remove the need for both cloudnet's to be on the same vlan is to use BGP between cloudnet and cloudgw. Both cloudnet's could announce the ranges they control, and if one died the cloudgw stops using it as the BGP session dies.

FWIW we have exactly the same scenario between cloudsw and cloudgw on the other side. The cloudgw share a VIP on the same vlan to provide a next-hop for static routes the cloudsw has. BGP is again the way to improve on that.

OK, closing for now and hoping some more modern BGP-based approach is introduced in the future.