Page MenuHomePhabricator

codfw: more vlans setup changes in the cloudgw PoC
Closed, ResolvedPublic

Description

Enable the following switch setup in cloudgw (labtestvirt2003), specifically the end state should be:

  • 2118 - cloud-hosts1-codfw (10.192.20.0/24) for eno1, non tagged port.
  • 2107 - cloud-gw-transport-codfw (185.15.57.144/31) [new allocation], for eno2, tagged in a trunk with vlan 2120
  • 2120 - cloud-instances-transport1-codfw (208.80.153.184/29) [in the future 185.15.57.128/29] for eno2, tagged in a trunk with vlan 2107

On the cloudnet side, we would need to:

  • drop vlan 2120 - cloud-instance-transport1-b-codfw
  • add vlan 2107 - cloud-gw-transport-codfw
NOTE: we are not doing the bonding experiment for now.

The PoC consist on the labtestvirt2003 server running with the puppet role for cloudgw.

image.png (276×900 px, 61 KB)

More info: https://wikitech.wikimedia.org/wiki/Wikimedia_Cloud_Services_team/EnhancementProposals/2020_Network_refresh/Implementation_details#codfw

Event Timeline

aborrero renamed this task from cofdw: enable more vlans in the cloudgw PoC to cofdw: more vlans setup changes in the cloudgw PoC.Sep 29 2020, 11:35 AM
aborrero updated the task description. (Show Details)

I updated the task description to include info on the native vlan we need in order to install the server.

Change 630812 had a related patch set uploaded (by Arturo Borrero Gonzalez; owner: Arturo Borrero Gonzalez):
[operations/puppet@production] openstack: cloudgw: introduce native vlan for easier reimaging

https://gerrit.wikimedia.org/r/630812

cloud-hosts1-codfw with native-vlan trunked as well as cloud-instances-transport1-codfw.

About vlan 2107, how many IPs do you need? If similar to the diagram a /31 should be enough.

Change 630812 merged by Arturo Borrero Gonzalez:
[operations/puppet@production] openstack: cloudgw: introduce native vlan for easier reimaging

https://gerrit.wikimedia.org/r/630812

When I try to reimage the server, the DHCP boot wont work. I believe the vlan configuration is right, and therefore suspect there is something going on in the link aggregation config.

I see we are using the LACP active mode, and I wonder if we should use passive instead, so the bonding is not seen as created until the server has come online and configured the bonding in its side:

--- old.txt	2020-10-02 11:39:08.757002938 +0200
+++ new.txt	2020-10-02 11:42:35.365170394 +0200
@@ -4,7 +4,7 @@
         mtu 9192;
         aggregated-ether-options {
             lacp {
-                active;
+                passive;
                 periodic fast;
             }
         }

if this doesn't work, perhaps an alternative solution to explore would be to leave eth0 (or whatever is the name) out of the port aggregation, leave it only for SSH/DHCP and use the other NICs to build the bonding just for the dataplane.

The setup above was declined by @ayounsi

I've sent an email to @Papaul to know if it would be possible to let the HP iLO know that the 2 NICs in the server are working in LACP mode , which is another option to explore.

The keyword you're looking for is most likely force-up which would mean introducing a snowflake in our config and automation. While we're moving away from manual changes.
It also defeats the purpose of LACP if left on the long run as it would keep a faulty interface up.

Reserved 185.15.57.8/31. I'll create the vlan early next week.

Vlan cloud-gw-transport-codfw created and trunked to labtestvirt2003

New update: in order to workaround the constraints we found when working with the bonding+trunking setup, and per suggestion by @ayounsi, I think we should:

  • separate control plane and data plane interfaces for now, at least for the PoC
  • that means using eth0 (or whatever is the name is in the linux server) for control plane networking (ssh, cloud-host 10. subnet)
  • using eth1 (or whatever the name is in the linux server) for data plane (a tagged vlan trunk with vlans 2107 and 2120)

These changes can be made anytime, hopefully as soon as possible. I will prepare the corresponding puppet patch to update the network config in the server side.

Change 632659 had a related patch set uploaded (by Arturo Borrero Gonzalez; owner: Arturo Borrero Gonzalez):
[operations/puppet@production] openstack: cloudgw: refresh network setup

https://gerrit.wikimedia.org/r/632659

Change 632659 merged by Arturo Borrero Gonzalez:
[operations/puppet@production] openstack: cloudgw: refresh network setup

https://gerrit.wikimedia.org/r/632659

aborrero updated the task description. (Show Details)
ayounsi renamed this task from cofdw: more vlans setup changes in the cloudgw PoC to codfw: more vlans setup changes in the cloudgw PoC.Oct 7 2020, 12:30 PM

Pushed the following change, you should be good to go!

[edit interfaces interface-range vlan-cloud-hosts1-b-codfw]
     member ge-8/0/5 { ... }
+    member ge-1/0/11;
[edit interfaces ge-1/0/11]
-    ether-options {
-        802.3ad ae3;
-    }
[edit interfaces ge-1/0/12]
-    ether-options {
-        802.3ad ae3;
-    }
[edit interfaces ge-1/0/12]
+    unit 0 {
+        family ethernet-switching {
+            interface-mode trunk;
+            vlan {
+                members [ cloud-instance-transport1-b-codfw cloud-gw-transport-codfw ];
+            }
+        }
+    }
[edit interfaces]
-   ae3 {
-       description labtestvirt2003;
-       native-vlan-id 2118;
-       mtu 9192;
-       aggregated-ether-options {
-           lacp {
-               active;
-               periodic fast;          
-           }
-       }
-       unit 0 {
-           family ethernet-switching {
-               interface-mode trunk;
-               vlan {
-                   members [ cloud-hosts1-b-codfw cloud-instance-transport1-b-codfw cloud-gw-transport-codfw ];
-               }
-           }
-       }
-   }

ok thanks! It works. I'm now able to reimage labtestvirt2003 (cloudgw).

One last change (hopefully) we need in order to see traffic flowing:

On the cloudnet side, we would need to:

  • drop vlan 2120 - cloud-instance-transport1-b-codfw
  • add vlan 2107 - cloud-gw-transport-codfw

Or perhaps simply add 2107 and leave 2120 in place, so there is less work to do in case we need to unwind the work in the future.

Pushed:

[edit interfaces interface-range cloud-net-trunk unit 0 family ethernet-switching vlan]
-       members [ cloud-instances2-b-codfw cloud-instance-transport1-b-codfw ];
+       members [ cloud-instances2-b-codfw cloud-instance-transport1-b-codfw cloud-gw-transport-codfw ];

Change 632904 had a related patch set uploaded (by Arturo Borrero Gonzalez; owner: Arturo Borrero Gonzalez):
[operations/puppet@production] hieradata: labtestvirt2003: refresh network data for cloudgw PoC with latest allocations

https://gerrit.wikimedia.org/r/632904

Change 632904 merged by Arturo Borrero Gonzalez:
[operations/puppet@production] cloudgw: refresh network config for the PoC

https://gerrit.wikimedia.org/r/632904

About vlan 2107, how many IPs do you need? If similar to the diagram a /31 should be enough.

It turns out we may need more after all.

I think neutron is being a bit smart-ass and prevent a setup in which we use the subnet broadcast address as an address for an actual host.

When defining the new subnet in neutron:

root@cloudcontrol2001-dev:~# openstack subnet create --network wan-transport-codfw --gateway 185.15.57.8 --no-dhcp --subnet-range 185.15.57.8/31 cloud-gw-transport-codfw
BadRequestException: 400: Client Error for url: http://openstack.codfw1dev.wikimediacloud.org:9696/v2.0/subnets, {"NeutronError": {"type": "InvalidInput", "message": "Invalid input for operation: Gateway is not valid on subnet.", "detail": ""}}

I checked the source code, and this is the validation function:

/usr/lib/python3/dist-packages/neutron/ipam/utils.py
def check_gateway_invalid_in_subnet(cidr, gateway):
    """Check whether the gw IP address is invalid on the subnet."""
    ip = netaddr.IPAddress(gateway)
    net = netaddr.IPNetwork(cidr)
    # Check whether the gw IP is in-valid on subnet.
    # If gateway is in the subnet, it cannot be the
    # 'network' or the 'broadcast address (only in IPv4)'.
    # If gateway is out of subnet, there is no way to
    # check since we don't have gateway's subnet cidr.
    return (ip in net and
            (net.version == constants.IP_VERSION_4 and
            ip in (net.network, net[-1])))

Manually reproducing this check:

>>> import netaddr
>>> net = netaddr.IPNetwork('185.15.57.8/31')
>>> ip = netaddr.IPAddress('185.15.57.8')
>>> ip in net and ip in (net.network, net[-1])
True

Whereas the old setup using a /29 works (the old setup is vlan 2120 cloud-instances-transport1-b-codfw 208.80.153.184/29):

>>> net = netaddr.IPNetwork('208.80.153.184/29')
>>> ip = netaddr.IPAddress('208.80.153.185')
>>> ip in net and ip in (net.network, net[-1])
False

Trying with a /30 seems to work:

>>> import netaddr
>>> net = netaddr.IPNetwork('185.15.57.8/30')
>>> ip = netaddr.IPAddress('185.15.57.9')
>>> ip in net and ip in (net.network, net[-1])
False

I'm willing to try finding a workaround for this, but perhaps the simplest and most elegant way to move forward is to just allocate a /30 instead of a /31.

Mentioned in SAL (#wikimedia-cloud) [2020-10-08T16:03:51Z] <arturo> [codfw1dev] briefly live-hacked python3-neutron source code in all 3 cloudcontrol2xxx-dev servers to workaround /31 network definition issue (T263622)

Moreover, if I workaround the first validation, when I try to assign the address to the virtual router, I get:

root@cloudcontrol2001-dev:~# openstack router set --external-gateway wan-transport-codfw --fixed-ip subnet=cloud-gw-transport-codfw,ip-address=185.15.57.9 cloudinstances2b-gw
BadRequestException: 400: Client Error for url: http://openstack.codfw1dev.wikimediacloud.org:9696/v2.0/routers/5712e22e-134a-40d3-a75a-1c9b441717ad, {"NeutronError": {"type": "InvalidIpForSubnet", "message": "IP address 185.15.57.9 is not a valid IP for the specified subnet.", "detail": ""}}

I'm more convinced now the /30 should do ti.

Mentioned in SAL (#wikimedia-cloud) [2020-10-08T16:17:09Z] <arturo> [codfw1dev] root@cloudcontrol2001-dev:~# openstack subnet create --network wan-transport-codfw --gateway 185.15.57.8 --no-dhcp --subnet-range 185.15.57.8/31 cloud-gw-transport-codfw (with a hack -- see task) (T263622)

Change 633147 had a related patch set uploaded (by Arturo Borrero Gonzalez; owner: Arturo Borrero Gonzalez):
[operations/puppet@production] cloudgw: refresh CIDR for vlan 2107 - cloud-gw-transport-codfw

https://gerrit.wikimedia.org/r/633147

Change 633147 merged by Arturo Borrero Gonzalez:
[operations/puppet@production] cloudgw: refresh CIDR for vlan 2107 - cloud-gw-transport-codfw

https://gerrit.wikimedia.org/r/633147

I believe everything here is done!