Page MenuHomePhabricator

Change cloud-instance-transport vlan subnets from /30 to /29
Closed, ResolvedPublic

Description

When the cloudgw was first introduced it was decided to use a /30 IPv4 subnet [[ https://netbox.wikimedia.org/search/?q=cloud-instance-transport&obj_type= | between the cloudgw and cloudnet (neutron) ]]servers, mostly to save on public IPv4 space. That was in contrast with the use of a /29 on the cloudgw-transport vlan (between cloudsw and cloudgw), which allows the cloudsw to run VRRP with a dedicated IP on each switch.

When we hit certain Netbox admin discrepancies (see T295774), the config on this vlan was modified further to use a /32 IP on the Ethernet link on the cloudgw side. That change then meant the cloudgw didn't see the cloudnet next-hop IP as connected, and instead a work-around was deployed involving the use of static "onlink" routes. This was complicated by the choice of a /30 subnet, which meant that the non-active cloudgw had no ip on the subnet at all, and thus the next-hop work-around route wouldn't apply at all (solved by making keepalived manage the routes, so they were only added when a cloudgw became active).

Ultimately this shouldn't be needed. There is a normal Ethernet subnet here and ARP should use as specified. In relation to T295774 the VIPs in question here are not /32s on a loopback interface, and should be configured with the appropriate netmask on the host.

Further, to avoid the problem with the non-active node not having an IP in the subnet at all times, making it impossible to have normal static routes for networks behind neutron, the subnet on the vlan should be widened from /30 to /29, so a dedicated IP can be allocated to each cloudgw at all times. Keepalived still takes care of the VIP and moves it from one to other.

Luckily we could widen the existing subnets to /29 in both cases, so the existing elements can keep their current IPs, and we have assigned new per-device permanent IPs for the cloudgw.

https://netbox.wikimedia.org/ipam/prefixes/353/ip-addresses/
https://netbox.wikimedia.org/ipam/prefixes/393/ip-addresses/

Creating this task to discuss / track progress to rolling out these changes on the cloudgw's themselves.

changes needed

  • change the definition on hiera for keepalived to pick up
  • Recreate the cloud-gw-transport-codfw subnet in openstack with /29 cidr
    • This means having to remove and then create the port 185.15.57.10
    • Remove the network from the router cloudinstances2b-gw (unsure on this - cm)
    • Delete the cloud-gw-transport-codfw subnet
    • Re-Create the cloud-gw-transport-codfw subnet with the new /29
    • Update the router cloudinstances2b-gw with the new subnet
    • Create the port 185.15.57.10

Similar for eqiad

Event Timeline

cmooney triaged this task as Low priority.
Restricted Application added a subscriber: Aklapper. · View Herald Transcript

Change 963298 had a related patch set uploaded (by Arturo Borrero Gonzalez; author: Arturo Borrero Gonzalez):

[operations/puppet@production] cloudgw: add an IPv4 address for each node in the cloudgw <-> neutron subnet

https://gerrit.wikimedia.org/r/963298

Change 963298 merged by Arturo Borrero Gonzalez:

[operations/puppet@production] cloudgw: add an IPv4 address for each node in the cloudgw <-> neutron subnet

https://gerrit.wikimedia.org/r/963298

The cloudgw side is now completed. We may want to refresh the neutron side as well:

aborrero@cloudcontrol1005:~$ sudo wmcs-openstack subnet show 77dba34f-c8f2-4706-a0b6-2a8ed4d91f51
+----------------------+--------------------------------------+
| Field                | Value                                |
+----------------------+--------------------------------------+
| allocation_pools     | 185.15.56.238-185.15.56.238          |
| cidr                 | 185.15.56.236/30                     |
| created_at           | 2021-05-06T15:35:19Z                 |
| description          |                                      |
| dns_nameservers      |                                      |
| dns_publish_fixed_ip | None                                 |
| enable_dhcp          | False                                |
| gateway_ip           | 185.15.56.237                        |
| host_routes          |                                      |
| id                   | 77dba34f-c8f2-4706-a0b6-2a8ed4d91f51 |
| ip_version           | 4                                    |
| ipv6_address_mode    | None                                 |
| ipv6_ra_mode         | None                                 |
| name                 | cloud-gw-transport-eqiad             |
| network_id           | 5c9ee953-3a19-4e84-be0f-069b5da75123 |
| project_id           | admin                                |
| revision_number      | 0                                    |
| segment_id           | None                                 |
| service_types        |                                      |
| subnetpool_id        | None                                 |
| tags                 |                                      |
| updated_at           | 2021-05-06T15:35:19Z                 |
+----------------------+--------------------------------------+

The cloudgw side is now completed. We may want to refresh the neutron side as well:

It's for the best if we can yes, should just be same IP with updated netmask.

Things are working now, including ARP from the wider subnet to cloudnet IP, but no point having them inconsistent.

cmooney@cloudgw1002:~$ sudo tcpdump -i vlan1107 -l -p -nn -e | grep 185.15.56.238
tcpdump: verbose output suppressed, use -v[v]... for full protocol decode
listening on vlan1107, link-type EN10MB (Ethernet), snapshot length 262144 bytes
15:34:53.534081 bc:97:e1:e2:86:30 > ff:ff:ff:ff:ff:ff, ethertype ARP (0x0806), length 42: Request who-has 185.15.56.238 tell 185.15.56.234, length 28
15:34:53.534150 fa:16:3e:93:02:b2 > bc:97:e1:e2:86:30, ethertype ARP (0x0806), length 56: Reply 185.15.56.238 is-at fa:16:3e:93:02:b2, length 42

Codfw equivalent subnet that needs changing also:

cmooney@cloudcontrol2005-dev:~$ sudo wmcs-openstack subnet show 2596edb4-5a40-41b9-9e67-f1f9e40e329c
+----------------------+--------------------------------------+
| Field                | Value                                |
+----------------------+--------------------------------------+
| allocation_pools     | 185.15.57.10-185.15.57.10            |
| cidr                 | 185.15.57.8/30                       |
| created_at           | 2020-10-09T08:48:11Z                 |
| description          |                                      |
| dns_nameservers      |                                      |
| dns_publish_fixed_ip | None                                 |
| enable_dhcp          | False                                |
| gateway_ip           | 185.15.57.9                          |
| host_routes          |                                      |
| id                   | 2596edb4-5a40-41b9-9e67-f1f9e40e329c |
| ip_version           | 4                                    |
| ipv6_address_mode    | None                                 |
| ipv6_ra_mode         | None                                 |
| name                 | cloud-gw-transport-codfw             |
| network_id           | 57017d7c-3817-429a-8aa3-b028de82cdcc |
| project_id           | admin                                |
| revision_number      | 0                                    |
| segment_id           | None                                 |
| service_types        |                                      |
| subnetpool_id        | None                                 |
| tags                 |                                      |
| updated_at           | 2020-10-09T08:48:11Z                 |
+----------------------+--------------------------------------+

Change 965708 had a related patch set uploaded (by Cathal Mooney; author: Cathal Mooney):

[operations/puppet@production] Change cloudgw VIPs to /29 so system can ARP from them

https://gerrit.wikimedia.org/r/965708

Change 965712 had a related patch set uploaded (by Cathal Mooney; author: Cathal Mooney):

[operations/puppet@production] Change codfw cloudgw VIPs to /29 so system can ARP from them

https://gerrit.wikimedia.org/r/965712

Change 965712 merged by Cathal Mooney:

[operations/puppet@production] Change codfw cloudgw VIPs to /29 so system can ARP from them

https://gerrit.wikimedia.org/r/965712

Change 965720 had a related patch set uploaded (by Cathal Mooney; author: Cathal Mooney):

[operations/puppet@production] Add 'src' to ip route statement on cloudgw to ensure VIP used for ARP

https://gerrit.wikimedia.org/r/965720

Change 965720 abandoned by Cathal Mooney:

[operations/puppet@production] Add 'src' to ip route statement on cloudgw to ensure VIP used for ARP

Reason:

Cannot have route with VIP 'src' on backup cloudgw so won't work

https://gerrit.wikimedia.org/r/965720

dcaro raised the priority of this task from Low to High.Oct 13 2023, 12:42 PM
dcaro subscribed.

This is causing some issues, should be fixed sooner than later, bumping priority

Stab in the dark guessing what commands are needed in codfw, based on man page and some guides (including info Arturo had here).

wmcs-openstack port unset 1290224c-b1b4-4120-a1fe-70d25b28a3bf --fixed-ip
wmcs-openstack subnet delete 2596edb4-5a40-41b9-9e67-f1f9e40e329c
wmcs-openstack subnet create --network wan-transport-codfw --gateway 185.15.57.9 --no-dhcp --subnet-range 185.15.57.8/29 \ 
    --allocation-pool start=185.15.57.10,end=185.15.57.10 cloud-gw-transport-codfw
wmcs-openstack port set 1290224c-b1b4-4120-a1fe-70d25b28a3bf --fixed-ip subnet=cloud-gw-transport-codfw,ip-address=185.15.57.10 
wmcs-openstack router set --external-gateway wan-transport-codfw --fixed-ip subnet=cloud-gw-transport-codfw,ip-address=185.15.57.10 cloudinstances2b-gw

What I'm not 100% sure on is the "port" stuff, and how that relates to the router 'external gateway'. I kind of suspect the router creates the port itself possibly, so maybe second last line isn't needed? That said the first line is likely needed, as "openstack router unset" man page doesn't show it allowing unset of the external-gateway.

Ok we seem to have muddled through, for the record commands needed as follows:

wmcs-openstack port unset 1290224c-b1b4-4120-a1fe-70d25b28a3bf --fixed-ip subnet=cloud-gw-transport-codfw,ip-address=185.15.57.10
wmcs-openstack subnet delete cloud-gw-transport-codfw
wmcs-openstack subnet create --network wan-transport-codfw --gateway 185.15.57.9 --no-dhcp --subnet-range 185.15.57.8/29 --allocation-pool start=185.15.57.10,end=185.15.57.10 cloud-gw-transport-codfw
wmcs-openstack router set --external-gateway wan-transport-codfw --fixed-ip subnet=cloud-gw-transport-codfw,ip-address=185.15.57.10 cloudinstances2b-gw
wmcs-openstack router set --disable-snat cloudinstances2b-gw --external-gateway wan-transport-codfw
dcaro changed the task status from Open to In Progress.Oct 20 2023, 8:11 AM
dcaro moved this task from To refine to Doing on the User-dcaro board.

Commands for change later on:

wmcs-openstack port unset ca4cb8c7-bfb8-440b-8e41-74bb8e834717 --fixed-ip subnet=cloud-gw-transport-eqiad,ip-address=185.15.56.238
wmcs-openstack subnet delete cloud-gw-transport-eqiad
wmcs-openstack subnet create --network wan-transport-eqiad --gateway 185.15.56.237 --no-dhcp --subnet-range 185.15.56.232/29 --allocation-pool start=185.15.56.238,end=185.15.56.238 cloud-gw-transport-eqiad
wmcs-openstack router set --external-gateway wan-transport-eqiad --fixed-ip subnet=cloud-gw-transport-eqiad,ip-address=185.15.56.238 cloudinstances2b-gw
wmcs-openstack router set --disable-snat cloudinstances2b-gw --external-gateway wan-transport-eqiad

Change 965708 merged by Cathal Mooney:

[operations/puppet@production] Change eqiad cloudgw VIPs to /29 so system can ARP from them

https://gerrit.wikimedia.org/r/965708

dcaro updated the task description. (Show Details)

This went as expected, and all the changes have been applied :)
Thanks a lot @cmooney !