When the cloudgw was first introduced it was decided to use a /30 IPv4 subnet [[ https://netbox.wikimedia.org/search/?q=cloud-instance-transport&obj_type= | between the cloudgw and cloudnet (neutron) ]]servers, mostly to save on public IPv4 space. That was in contrast with the use of a /29 on the cloudgw-transport vlan (between cloudsw and cloudgw), which allows the cloudsw to run VRRP with a dedicated IP on each switch.
When we hit certain Netbox admin discrepancies (see T295774), the config on this vlan was modified further to use a /32 IP on the Ethernet link on the cloudgw side. That change then meant the cloudgw didn't see the cloudnet next-hop IP as connected, and instead a work-around was deployed involving the use of static "onlink" routes. This was complicated by the choice of a /30 subnet, which meant that the non-active cloudgw had no ip on the subnet at all, and thus the next-hop work-around route wouldn't apply at all (solved by making keepalived manage the routes, so they were only added when a cloudgw became active).
Ultimately this shouldn't be needed. There is a normal Ethernet subnet here and ARP should use as specified. In relation to T295774 the VIPs in question here are not /32s on a loopback interface, and should be configured with the appropriate netmask on the host.
Further, to avoid the problem with the non-active node not having an IP in the subnet at all times, making it impossible to have normal static routes for networks behind neutron, the subnet on the vlan should be widened from /30 to /29, so a dedicated IP can be allocated to each cloudgw at all times. Keepalived still takes care of the VIP and moves it from one to other.
Luckily we could widen the existing subnets to /29 in both cases, so the existing elements can keep their current IPs, and we have assigned new per-device permanent IPs for the cloudgw.
https://netbox.wikimedia.org/ipam/prefixes/353/ip-addresses/
https://netbox.wikimedia.org/ipam/prefixes/393/ip-addresses/
Creating this task to discuss / track progress to rolling out these changes on the cloudgw's themselves.
changes needed
- change the definition on hiera for keepalived to pick up
- Recreate the cloud-gw-transport-codfw subnet in openstack with /29 cidr
- This means having to remove and then create the port 185.15.57.10
- Remove the network from the router cloudinstances2b-gw (unsure on this - cm)
- Delete the cloud-gw-transport-codfw subnet
- Re-Create the cloud-gw-transport-codfw subnet with the new /29
- Update the router cloudinstances2b-gw with the new subnet
- Create the port 185.15.57.10
Similar for eqiad