Page MenuHomePhabricator

cloudvps: eqiad1: review floating IP mechanisms
Closed, ResolvedPublic

Description

It has been detected that allocating a new floating IP address using horizon for the new eqiad1 deployment results in neutron allocating 3 different addresses, for example 185.15.56.13, 10.64.22.2, 10.64.22.3.
Only the first address belongs to the proper pool, the other two belong to the transport subnet.

So we suspect there is a misconfiguration somewhere and neutron doesn't know which pool to use for floating IPs.

Event Timeline

My initial thinking follows.

1-- architecture:
The high level neutron network object wan-transport-eqiad contains 2 subnets:

  • one for transport 10.64.22.0/24 (subnet object name: cloud-instances-transport1-b-eqiad)
  • one for floating IPs 185.15.56.0/25 (subnet object name: cloud-eqiad1-floating)

Floating IPs should be only allocated from the public range subnet.
I don't find a way to tell neutron what's the difference between the 2 subnets (the transport WAN and the floating WAN) and to forbid allocation from the transport WAN.
Some ideas:

  • re-create the transport WAN subnet hacking the CIDR so neutron thinks this is already full/no allocation possible (i.e, use something like 10.64.22.0/30)
  • don't have this transport WAN subnet object at all. This may impact the HA of the ingress, since the main IP of the router is in this subnet and handled by neutron to be in HA. This may require changes in the core routers as well.

2-- horizon
Horizon doesn't allow to specify a concrete subnet. Rather it uses the higher level network object, and in our case this contains 2 subnets objects:

alloc.png (370×757 px, 21 KB)

In fact, the command line has this option: neutron floatingip-create [...] [--subnet SUBNET_ID] FLOATING_NETWORK
(note, the network object is mandatory, the subnet object is optional)

In the future, we may have more available public ranges for floating IP allocations, so I think horizon should let choose between subnet objects rather than high level network objects.

(related docs: https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Neutron#example)

I'm not sure what the impact of changing cloud-instances-transport1-b-eqiad would be. In theory there are no available floating IPs in the 10.64.22.0/24 pool afair? My understanding was you had to not only have available IPs but mark them as part of an available float pool. So I'm confused on what's getting allocated there. Either I misunderstand or horizon and neutron are doing something really weird or both. Getting rid of cloud-instances-transport1-b-eqiad entirely probably has a whole other host of issues but isn't necessarily impossible.

My vote would be hardcoding available SUBNET_IDs in horizon somehow and/or figuring out how to tell Neutron that only certain subnets are user allocateable. It feels like that functionality should exist natively but maybe it's not yet present in Mitaka.

so I think horizon should let choose between subnet objects rather than high level network objects.

hardcoding in horizon what subnet users are allowed for floating IPs seems like the best of breed current hack? If that's easy enough to mimic the CLI calls.

aborrero triaged this task as Medium priority.Aug 21 2018, 4:15 PM

Mentioned in SAL (#wikimedia-cloud) [2018-08-23T13:08:06Z] <arturo> T202115 root@cloudcontrol1003:~# neutron subnet-update --allocation-pool start=10.64.22.254,end=10.64.22.254 e4fb2771-a361-4add-ac4e-280cc300c59f

Mentioned in SAL (#wikimedia-cloud) [2018-08-23T13:10:28Z] <arturo> T202115 (was {"start": "10.64.22.2", "end": "10.64.22.254"} )

Mentioned in SAL (#wikimedia-cloud) [2018-08-23T13:15:13Z] <arturo> T202115 root@cloudcontrol1003:~# neutron subnet-update --allocation-pool start=10.64.22.4,end=10.64.22.4 e4fb2771-a361-4add-ac4e-280cc300c59f

It seems that restricting the allocation pool to 1 IP address (already assigned to the virtual router itself) prevents Neutron from creating floating IPs from this subnet.
What a hack, but it may work...

@Andrew could we understand this issue as solved?

For the record, solution was to restrict the allocation pool to 1 IP which is already assigned.

Feel free to reopen if more issues or questions appears.