Page MenuHomePhabricator

Openstack Magnum network setup
Closed, ResolvedPublic

Description

Magnum likes to give out floating IPs to access a cluster. This is how we tested in codfw1dev. This wouldn't work for us as an offering, so we're going to try another route.

Namely creating an internal subnet for magnum to use.

If the network is not external enough the cluster will fail to build and complain. According to https://docs.openstack.org/magnum/latest/user/#clustertemplate we need the following on the network for it to be accepted:
‘router:external’ must be ‘True’

https://docs.openstack.org/magnum/latest/user/#network-for-vms
Might suggest that this is all a fool's errand

Event Timeline

If I understand correctly here, our options are:

  1. Just point Magnum to our existing floating-ip pool, and regard this as a good use of our still-fairly-abundant ipv4s
  2. Create a new floating IP pool with whatever flags Magnum expects, but use IPs in the same subnet as lan-flat-cloudinstances2b

Option 1 is obvious, and not necessarily terrible; it will make our use-case more conforming with other public clouds. Right now 82 out of 127 ipv4s are allocated to projects but 25 or so are not actually in use so could probably be clawed back.

To explain option 2:

My hope is that we don't need to use floating IPs for our magnum floating IPs because people can tunnel/proxy/whatever to the cluster via a bastion. I wouldn't expect upstream docs to take that into account since the idea of a cloud-wide bastion is a weird, wmcs-only thing that no one else does.

That said, if each cluster is on a private subnet then we DO need an IP for ingress that's on our internal-but-cloud-wide network so that other VMs (e.g. bastions) in cloud-vps can get into that subnet at all.

I don't really feel like option 1 is serviceable, unless we only wanted to keep this as an on request only kind of thing. I would expect a cloud provider to give me a URL that points to my cluster, rather than an IP. This would be used for kubectl access, but not service access. For kubectl access I see no reason that it has to be publicly available. This would be fine through the bastion.

As for accessing a service. I suspect the assumed way around this is with a k8s loadbalancer. Which I assume would require something like octavia to function. If we want to go this route, that seems fine, if not we can probably kind of use keepalived, though that isn't load balancing, so one node will end up with all the traffic, I'm not sure if we have any projects where this will become a problem. We try to get around this with "ingress" nodes for k8s, which isn't really a thing, they're just there to absorb traffic, since we don't distribute it across a group of nodes.

https://docs.openstack.org/magnum/latest/user/#network-for-vms implies, in option 1, we might not have to engage in networking silliness to get clusters deployed, though doesn't explain how so far as I understand it.

No idea if it helps, but searching around there seems to be a 'floating_ip_enabled' parameter for the cluster template: https://opendev.org/openstack/magnum/src/commit/cd113dfc0c026b2a94e3e347f2ed4c114e5478d1/api-ref/source/clustertemplates.inc#L58

I'm afraid this is a rather typical openstack setback :-( it wasn't designed of our specific use case or setup.

Option #1 could work in the short term, if we wanted to quickly unblock the project and get magnum working. It wont last long, though, because it means we will be able to create 25 k8s clusters only :-P

Option #2 has several implications: may deploying openstack octavia as Vivian mentioned and even worse, involves introducing Neutron tenant networks, something we've been explicitly delaying.

I think we should explore an option #3 perhaps, (to see if that even exists). On a quick read, it seems the cluster template accepts some fixed-network and fixed-subnet parameters. Not sure the semantics of those however.

No idea if it helps, but searching around there seems to be a 'floating_ip_enabled' parameter for the cluster template: https://opendev.org/openstack/magnum/src/commit/cd113dfc0c026b2a94e3e347f2ed4c114e5478d1/api-ref/source/clustertemplates.inc#L58

I saw that, and have been tinkering with it. The docs only seem to mention it in reference to master_lb_floating_ip_enabled, which is a label, the docs are not explicit on if floating_ip_enabled is a label or not. At any rate my tinkering with it has, thus far, still resulted in clusters giving out floating IPs

I'm afraid this is a rather typical openstack setback :-( it wasn't designed of our specific use case or setup.

What was it designed for? At least for what I was hoping was a k8s cluster with some way to access it.

Option #1 could work in the short term, if we wanted to quickly unblock the project and get magnum working. It wont last long, though, because it means we will be able to create 25 k8s clusters only :-P

Better yet it likes to give the worker nodes floating IPs too. Won't even get us to 25 clusters.

I think we should explore an option #3 perhaps, (to see if that even exists). On a quick read, it seems the cluster template accepts some fixed-network and fixed-subnet parameters. Not sure the semantics of those however.

This is most of the direction that things have been going. Those two seem to be which network, and redundantly which subnet, it will give out internal IPs for.

The template build command looks like this:

openstack coe cluster template create core-34-k8s21-100g-no-ip \
--image magnum-fedora-coreos-34 \
--external-network wan-transport-eqiad \
--fixed-network lan-flat-cloudinstances2b \
--fixed-subnet cloud-instances2-b-eqiad \
--dns-nameserver 8.8.8.8 --network-driver flannel \
--docker-storage-driver overlay2 --docker-volume-size 100 \
--master-flavor g3.cores1.ram2.disk20 --flavor g3.cores1.ram2.disk20 \
--coe kubernetes \
--labels kube_tag=v1.21.8-rancher1-linux-amd64,hyperkube_prefix=docker.io/rancher/,cloud_provider_enabled=true \
--public

There is a tempting --external-network option, but leaving it off will give an error that it is required.

Meanwhile setting external-network to lan-flat-cloudinstances2b gives:
Unable to find external network lan-flat-cloudinstances2b (HTTP 400) (Request-ID: req-95610e3a-52db-4992-87b0-33aa44a04546)
I'm not clear on if this is because lan-flat-cloudinstances2b lacks attribute ‘router:external’ set as ‘True’. (https://docs.openstack.org/magnum/latest/user/#clustertemplate)

The last bit of the above is what this ticket is meant for. To setup a subnet with private IPs and assign with the attribute ‘router:external’ set as ‘True’. And see if that allows magnum to build a template and cluster with just private IPs. Assuming it does, we can go from there to figure out how to route to it.

Mentioned in SAL (#wikimedia-cloud) [2022-10-25T16:03:37Z] <arturo> [codfw1dev] T321220 root@cloudcontrol2001-dev:~# openstack subnet create magnum --no-dhcp --network 57017d7c-3817-429a-8aa3-b028de82cdcc --ip-version 4 --gateway auto --subnet-range 192.168.0.0/24

For the sake of researching / PoC, I created a 192.168.0.0/24 subnet in the codfw1dev attached to the external network.

Try with this (in codfw1dev):

$ openstack coe cluster template create \
--external-network wan-transport-codfw \
--fixed-subnet magnum \
[...]

I would try leaving other network parameters it their default (specifically --fixed-network) and see what happens.

I'm afraid this is a rather typical openstack setback :-( it wasn't designed of our specific use case or setup.

What was it designed for? At least for what I was hoping was a k8s cluster with some way to access it.

I mean, they assume we have stuff in our setup that we don't have or that require us to bend the setup in unexpected ways. This already happened with openstack manila, with cinder backups, with openstack trove...

Take this for example:

–external-network <external-network>

    The name or network ID of a Neutron network to provide connectivity to the external internet for the cluster. This network must be an external network, i.e. its attribute ‘router:external’ must be ‘True’. The servers in the cluster will be connected to a private network and Magnum will create a router between this private network and the external network. This will allow the servers to download images, access discovery service, etc, and the containers to install packages, etc. In the opposite direction, floating IP’s will be allocated from the external network to provide access from the external internet to servers and the container services hosted in the cluster. This is a mandatory parameter and there is no default value.

I'm not sure this will work in our edge network without changes to the egress NAT etc. But I may be overly defensive. Let's hope for the best!

I'd be happy to schedule a bit more of time for additional research on this if the magnum subnet that I created doesn't work.

For the sake of researching / PoC, I created a 192.168.0.0/24 subnet in the codfw1dev attached to the external network.

Try with this (in codfw1dev):

$ openstack coe cluster template create \
--external-network wan-transport-codfw \
--fixed-subnet magnum \
[...]

I would try leaving other network parameters it their default (specifically --fixed-network) and see what happens.

Not too much luck with this. We can create the template but not the cluster itself. Eventually stack fails with:
Resource CREATE failed: ServiceUnavailable: resources.network.resources.private_network: Unable to create the network. No tenant network is available for allocation.
I'm guessing that neutron is not allowing/configured to do this?

https://docs.openstack.org/magnum/latest/user/#networking suggests:
If not specified, a new Neutron private network will be created

The template used:

openstack coe cluster template create core-34-k8s21-100g-no-ip \
--image Fedora-CoreOS-34 \
--external-network wan-transport-codfw \
--fixed-subnet magnum \
--dns-nameserver 8.8.8.8 --network-driver flannel \
--docker-storage-driver overlay2 --docker-volume-size 100 \
--master-flavor g2.cores1.ram2.disk20 --flavor g2.cores1.ram2.disk20 \
--coe kubernetes \
--labels kube_tag=v1.21.8-rancher1-linux-amd64,hyperkube_prefix=docker.io/rancher/,cloud_provider_enabled=true \
--public

Not too much luck with this. We can create the template but not the cluster itself. Eventually stack fails with:
Resource CREATE failed: ServiceUnavailable: resources.network.resources.private_network: Unable to create the network. No tenant network is available for allocation.
I'm guessing that neutron is not allowing/configured to do this?

Yes :-( Our neutron setup does not allow tenant networks.

Enabling them may involve rethinking how we do virtual networking cloud-wide. But we should do it. Having tenant networks has been one of our goals with neutron for years. The only reason we don't have them now is because we didn't have the human-power yet.

Can someone catch me up about why we're talking about tenant subnets? Is this something we need in eqiad that we didn't need in codfw1dev? Or /are/ tenant subnets enabled in codfw1dev?

@Andrew we're trying to figure out how we can get magnum installed in a way that does not require providing it with floating IP addressed. Arturo setup a subnet in codfw1dev to try this, though the result of the template above was the cluster wanting a tenant network. Which isn't configured, so far as I know anywhere.

adding the fixed-network back in ultimately errors out with:

Resource CREATE failed: ResourceInError: resources.kube_masters.resources[0].resources.kube-master: Went to status ERROR due to "Message: Exceeded maximum number of retries. Exceeded max scheduling attempts 3 for instance 334ac559-d298-4b7d-b99e-cd1cecf1b141. Last exception: Binding failed for port 85d8ac28-65f8-4a70-96a6-68029530d81e, please check neutron logs for more information., Code: 500

openstack coe cluster template create core-34-k8s21-100g-no-ip-fixed \
--image Fedora-CoreOS-34 \
--external-network wan-transport-codfw \
--fixed-subnet magnum \
--fixed-network wan-transport-codfw \
--dns-nameserver 8.8.8.8 --network-driver flannel \
--docker-storage-driver overlay2 --docker-volume-size 100 \
--master-flavor g2.cores1.ram2.disk20 --flavor g2.cores1.ram2.disk20 \
--coe kubernetes \
--labels kube_tag=v1.21.8-rancher1-linux-amd64,hyperkube_prefix=docker.io/rancher/,cloud_provider_enabled=true \
--public

That one is interesting. I would like to check the neutron logs and see what happened. It seems we passed most of the logic validation, so we may be hitting now some other misconfiguration somewhere?

That one is interesting. I would like to check the neutron logs and see what happened. It seems we passed most of the logic validation, so we may be hitting now some other misconfiguration somewhere?

I was able to rescue the neutron error message:

Failed to bind port 85d8ac28-65f8-4a70-96a6-68029530d81e on host cloudvirt2002-dev for vnic_type normal using segments [{'id': '980ffcbe-fd1c-480d-b8cd-b0990d249a72', 'network_type': 'flat', 'physical_network': 'br-external', 'segmentation_id': None, 'network_id': '57017d7c-3817-429a-8aa3-b028de82cdcc'}]

Not very meaningful. We don't have br-external or cloudvirts, that's something on cloudnet servers.

Not sure what will happen with it, but looks like floating_ip_enabled doesn't go in as a label, but rather a command line option. --floating_ip_enabled false might do something on cluster template creation. Though at the moment the same template in dev has stopped using floating IPs regardless, cluster doesn't finish building but doesn't have a floating IP.

For anyone interested more notes are https://etherpad.wikimedia.org/p/xena-dev-magnum-install and https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Devstack_magnum/Stable_xena

As requested a summary of how the cluster was attempted to be deployed in codfw1dev. The template was created as follows:

openstack coe cluster template create core-34-k8s21-100g --image Fedora-CoreOS-34 --external-network wan-transport-codfw --fixed-network 05a5494a-184f-4d5c-9e98-77ae61c56daa --fixed-subnet cloud-instances2-b-codfw --dns-nameserver 8.8.8.8 --network-driver flannel --docker-storage-driver overlay2 --docker-volume-size 100 --master-flavor g2.cores1.ram2.disk20 --flavor g2.cores1.ram2.disk20 --coe kubernetes --labels kube_tag=v1.21.8-rancher1-linux-amd64,hyperkube_prefix=docker.io/rancher/,cloud_provider_enabled=true

Cluster is deployed with:
openstack coe cluster create network-test4 --cluster-template core-34-k8s21-100g --master-count 1 --node-count 1 --keypair rookskey
This creates a cluster using internal IP addresses. It is unclear on why this is the case, as this previously created a cluster with floating IP addresses. The cluster however does not complete, the control vm builds but does not have internet access and thus fails. One can access the control node from 172.16.128.101 off the codfw1dev bastion, user is 'core' use the my codfw1dev test key (rookskey) in my homedir.

I traced a ping from the VM in the neutron virtual router:

10:23:32.436709 qr-21e10025-d4 In  IP 172.16.128.101 > 8.8.8.8: ICMP echo request, id 8, seq 1, length 64
10:23:32.436730 qg-1290224c-b1 Out IP 192.168.0.236 > 8.8.8.8: ICMP echo request, id 63141, seq 1, length 64

10:32:40.116592 qr-21e10025-d4 In  IP 172.16.128.101 > 8.8.8.8: ICMP echo request, id 9, seq 1, length 64
10:32:40.116624 qg-1290224c-b1 Out IP 192.168.0.236 > 8.8.8.8: ICMP echo request, id 36128, seq 1, length 64

You can see the packet from the VM 172.16.128.101 is NATed to 192.168.0.236.

This NAT address is from the magnum-dedicated subnet we created earlier:

aborrero@cloudcontrol2001-dev:~ $ sudo wmcs-openstack subnet show 94c22e79-bd19-43d6-89da-e7a9e7b07e1d
+----------------------+--------------------------------------+
| Field                | Value                                |
+----------------------+--------------------------------------+
| allocation_pools     | 192.168.0.2-192.168.0.254            |
| cidr                 | 192.168.0.0/24                       |
| created_at           | 2022-10-25T16:02:54Z                 |
| description          |                                      |
| dns_nameservers      |                                      |
| dns_publish_fixed_ip | None                                 |
| enable_dhcp          | False                                |
| gateway_ip           | 192.168.0.1                          |
| host_routes          |                                      |
| id                   | 94c22e79-bd19-43d6-89da-e7a9e7b07e1d |
| ip_version           | 4                                    |
| ipv6_address_mode    | None                                 |
| ipv6_ra_mode         | None                                 |
| name                 | magnum                               |
| network_id           | 57017d7c-3817-429a-8aa3-b028de82cdcc |
| project_id           | admin                                |
| revision_number      | 0                                    |
| segment_id           | None                                 |
| service_types        |                                      |
| subnetpool_id        | None                                 |
| tags                 |                                      |
| updated_at           | 2022-10-25T16:02:54Z                 |

I think this subnet is being used as 'floating IP' which is exactly what we were trying. Please note that no NAT was set for this address and the traffic is dropped in cloudgw as expected.

However from the command in the previous comment, I'm not sure where that was even specified at cluster template creation time?

In general I think we're moving in the right direction. Once it is clear how we can control which subnets are being used by magnum, we can then choose the right addressing and set up the required NATs.

Ah yes! You added a subnet! This may explain much! In particular why when I ran the same thing I got a different result. Both the 'magnum' and the 'cloud-codfw1dev-floating' subnet segments are part of the wan-transport-codfw network, which is fed in as part of the template (above) for a magnum cluster. As such where previously there were only floating IPs on that network, now there are some local IPs as well, and magnum found them, randomly? Can we set the magnum subnet such that traffic coming from it can get out to the internet?

Ah yes! You added a subnet! This may explain much! In particular why when I ran the same thing I got a different result. Both the 'magnum' and the 'cloud-codfw1dev-floating' subnet segments are part of the wan-transport-codfw network, which is fed in as part of the template (above) for a magnum cluster. As such where previously there were only floating IPs on that network, now there are some local IPs as well, and magnum found them, randomly? Can we set the magnum subnet such that traffic coming from it can get out to the internet?

Yes, we can route the magnum-dedicated subnet. But first we would need to understand how magnum is using/selecting it. We need a "deterministic" setup for it to work consistently (specially if we will later do the same in eqiad1).

Also, I used kind of an arbitrary CIDR 192.168.0.0/24 for testing purposes. We should probably go with 172.16.x.x once we understand what's going on with the subnet.

Note to self: the VM gets a normal IP address PLUS the 192.168.x.x address as floating IP.

So, I think action items are:

  • get to an understanding of why/how magnum uses the custom-created magnum subnet. This cannot be random. We need this to be deterministic.
  • allocate proper CIDR on the 172.16.x.x range and see how to hook it into our edge network. The cloudgw boxes may need to know about this new CIDR.
  • the cluster templating and mganum VMs seems to belong to the admin project. We need to better shape this for multi-tenancy purposes. Would the subnet thingy work if using an arbitrary project? Mind that the admin project has some special privileges. I suggest to create a project magnum-tests or something, to emulate a standard user project, and do all tests there.

Yes, we can route the magnum-dedicated subnet. But first we would need to understand how magnum is using/selecting it. We need a "deterministic" setup for it to work consistently (specially if we will later do the same in eqiad1).

empirical, not theoretical, evidence points to magnum using anything on the network specified. Can we setup a magnum network/subnet pair rather than a magnum subnet that is part of another network?

Also, I used kind of an arbitrary CIDR 192.168.0.0/24 for testing purposes. We should probably go with 172.16.x.x once we understand what's going on with the subnet.

Yes, though for codfw1dev a /24 is probably fine

Note to self: the VM gets a normal IP address PLUS the 192.168.x.x address as floating IP.

Yes, this is what I've observed previously. One is for internal communication among nodes if I understand correctly.

  • the cluster templating and mganum VMs seems to belong to the admin project. We need to better shape this for multi-tenancy purposes. Would the subnet thingy work if using an arbitrary project? Mind that the admin project has some special privileges. I suggest to create a project magnum-tests or something, to emulate a standard user project, and do all tests there.

So far the only place that I've run into a problem with this has been in eqiad1. Adding a --public flag to the template at creation gets around this. There might be weird things that can happen if we cannot restrict who can make templates however. Having unchecked abilities could lead to things like making an arbitrary template that uses floating IPs.

Change 853374 had a related patch set uploaded (by Arturo Borrero Gonzalez; author: Arturo Borrero Gonzalez):

[operations/homer/public@master] cr-cloud: enable openstack heat API TCP port

https://gerrit.wikimedia.org/r/853374

Change 853374 merged by Arturo Borrero Gonzalez:

[operations/homer/public@master] cr-cloud: enable openstack heat API TCP port

https://gerrit.wikimedia.org/r/853374

Change 853947 had a related patch set uploaded (by Arturo Borrero Gonzalez; author: Arturo Borrero Gonzalez):

[operations/homer/public@master] cr-cloud: enable openstack magnum API TCP port

https://gerrit.wikimedia.org/r/853947

aborrero renamed this task from Subnet for magnum to Openstack Magnum network setup.Nov 7 2022, 11:31 AM
aborrero triaged this task as Medium priority.
aborrero moved this task from Inbox to Doing on the cloud-services-team (Kanban) board.

Change 853947 merged by Arturo Borrero Gonzalez:

[operations/homer/public@master] cr-cloud: enable openstack magnum API TCP port

https://gerrit.wikimedia.org/r/853947

After opening the required ports in the firewall, magnum can create a working kubernetes cluster:

aborrero@cloudcontrol2001-dev:~ $ sudo wmcs-openstack coe cluster show a828c5f2-cde0-4ae7-88f0-0dfec184cf00
+----------------------+------------------------------------------------------------+
| Field                | Value                                                      |
+----------------------+------------------------------------------------------------+
| status               | CREATE_COMPLETE                                            |
| health_status        | UNKNOWN                                                    |
| cluster_template_id  | dec2e4cf-596e-4867-a88a-d99e571a2ada                       |
| node_addresses       | ['172.16.128.65']                                          |
| uuid                 | a828c5f2-cde0-4ae7-88f0-0dfec184cf00                       |
| stack_id             | 868583b6-67fb-4571-a6e6-e36b47013fe8                       |
| status_reason        | None                                                       |
| created_at           | 2022-11-07T11:39:19+00:00                                  |
| updated_at           | 2022-11-07T11:48:35+00:00                                  |
| coe_version          | v1.18.16                                                   |
| labels               | {}                                                         |
| labels_overridden    | {}                                                         |
| labels_skipped       | {}                                                         |
| labels_added         | {}                                                         |
| fixed_network        | lan-flat-cloudinstances2b                                  |
| fixed_subnet         | cloud-instances2-b-codfw                                   |
| floating_ip_enabled  | False                                                      |
| faults               |                                                            |
| keypair              | rookskey                                                   |
| api_address          | https://172.16.128.188:6443                                |
| master_addresses     | ['172.16.128.188']                                         |
| master_lb_enabled    | False                                                      |
| create_timeout       | 60                                                         |
| node_count           | 1                                                          |
| discovery_url        | https://discovery.etcd.io/3ec56601fd767c3ccc331843cc9536cd |
| docker_volume_size   | None                                                       |
| master_count         | 1                                                          |
| container_version    | 1.12.6                                                     |
| name                 | arturo-test-2                                              |
| master_flavor_id     | g2.cores1.ram2.disk20                                      |
| flavor_id            | g2.cores1.ram2.disk20                                      |
| health_status_reason | {'api': 'The cluster arturo-test-2 is not accessible.'}    |
| project_id           | admin                                                      |
+----------------------+------------------------------------------------------------+
aborrero@cloudcontrol2001-dev:~ $ sudo wmcs-openstack server list --all-project | grep arturo
| 0e290d47-32fa-4494-8167-debe8a2d2eef | arturo-test-2-audyvoa5ourm-node-0            | ACTIVE  | lan-flat-cloudinstances2b=172.16.128.65                 | Fedora-CoreOS-34                                                                                                                 | g2.cores1.ram2.disk20 |
| dcb80633-2b81-4564-a659-1a295c4a7b28 | arturo-test-2-audyvoa5ourm-master-0          | ACTIVE  | lan-flat-cloudinstances2b=172.16.128.188                | Fedora-CoreOS-34                                                                                                                 | g2.cores1.ram2.disk20 |

Then, inside the virtual machines:

[root@arturo-test-2-audyvoa5ourm-master-0 core]# kubectl get namespaces
NAME              STATUS   AGE
default           Active   5m7s
kube-node-lease   Active   5m9s
kube-public       Active   5m9s
kube-system       Active   5m9s
[root@arturo-test-2-audyvoa5ourm-master-0 core]# kubectl get nodes
NAME                                  STATUS   ROLES    AGE     VERSION
arturo-test-2-audyvoa5ourm-master-0   Ready    master   5m50s   v1.23.3
arturo-test-2-audyvoa5ourm-node-0     Ready    <none>   104s    v1.23.3
aborrero@bastion-codfw1dev-02:~$ curl https://172.16.128.188:6443  -k
{
  "kind": "Status",
  "apiVersion": "v1",
  "metadata": {},
  "status": "Failure",
  "message": "forbidden: User \"system:anonymous\" cannot get path \"/\"",
  "reason": "Forbidden",
  "details": {},
  "code": 403
}

I think magnum reports the k8s API as down because is not accessible from the internet (from the magnum API itself). For that we need to go back to re-think the floating IP model, or follow the upstream recomendation and introduce Openstack Octavia LBaaS:

For those interested this can also be accessed via the bastion by using the config generated with
openstack coe cluster config arturo-test-2 --dir arturo-test-2
found in /root/arturo-test-2/config

I think magnum reports the k8s API as down because is not accessible from the internet (from the magnum API itself).

Will https://gerrit.wikimedia.org/r/c/operations/puppet/+/854092 help with that, or are we talking about access in the other direction?

I think magnum reports the k8s API as down because is not accessible from the internet (from the magnum API itself).

Will https://gerrit.wikimedia.org/r/c/operations/puppet/+/854092 help with that, or are we talking about access in the other direction?

I don't think it will :-( I suspect the network flow is in the other direction.

The upstream docs suggest running Openstack Octavia as load balancing solution for Magnum-created kubernetes cluster. We have been discussing it over IRC.

However, I see a problem that octavia wouldn't solve: IPv4 scarcity. Octavia would be used to proxy the private IPv4 of the k8s API + any other k8s Service resouce of type: loadbalancer that users create in their clusters. But we don't have enough IPv4 for that either.

Moreover, we already have a shared proxy setup in our cloud: the nova-proxy project. I'm now wondering if we could somehow fork this nova-proxy into a shared k8s-proxy where we can hook magnum into. This hook should be made early in the cluster creation process so k8s certs are generated for the public FQDN, imagine something like api.mycluster.k8s-proxy.eqiad1.wmcloud.org.

In general I think we have 3 options here:

Option 1: explore a novaproxy-like solution

  • This solution is elegant and means all k8s clusters would share the same public IPv4.
  • We create and maintain a shared proxy layer for all magnum clusters, in a similar fashion to what we do today with novaproxy
  • We may have a proxy for the k8s-API and another for any Service of type LoadBalancer. So 2 proxies.
  • This wont be trivial: we may need to create our own LB driver for magnum to use at cluster creation time AND/OR an octavia-like API implementation. Is this realistic from the human-power POV?
  • BUT, perhaps we can hack our way directly inside the openstack heat engine that magnum is using, and hook the creation of the entry in our k8s-proxy in there.

Option 2: let k8s cluster owners deploy their own external proxy

  • Exactly what we do in tools and paws etc.
  • This option doesn't require us to code or engineer any driver or the like. Just good docs on how to do it. We could easily automate part of the deployment via puppet. Further automation may feel like we're reinventing Openstack Octavia (see option #3).
  • The main challenge here is TLS for the k8s API: the FQDN needs to be known at cluster creation time for x509 certs to be generated correctly for that FQDN.
  • Users needing to expose the cluster to the internet will need a floating IP. We can handle that via standard quota requests

Option 3: explore openstack octavia

  • this may or may not fit into how we do things, main questions being:
    • How much control can we have over octavia LBs?
    • Can a single LB be shared in different projects so we keep public IPv4 usage low? Kind of option #1
    • Would it work just like an automation of option #2?
    • Can octavia LBs integrate with designate, so public FQDNs are under our control?

Perhaps the root question we should ask ourselves at this point is what level of services we want to offer and how much more effort we want to put in here.

Worth noting that if we don't do any of this, openstack magnum would still work. Only the cluster won't be accessible from the internet until one of the options is implemented.

See also:

My opinion is that we should decide on one of the options before moving forward with magnum. I'd go with either option #1 or #2 (#2 implying less work on the short term).

I'll let @rook and possibly @nskaggs decide what next.

I setup something like option 1 in AWS, though only on the backend, which is to say I setup a lambda that would add new nodes that autoscaled out to a target group, such that I could integrate that growing and shrinking target group into an existing LB that had other things in it. This is basically what we would need in this regard, every cluster would need some kind of watcher, in our case we might be able to get away with a static watcher (though it would be kind of ugly), that is paying attention to what worker nodes there are, and adds them all to a pool that is part of a larger LB. If static the "watcher" would be a list of servers. If dynamic it would be a script that is ideally triggered by a scale out (scale in as well? I suspect whatever LB we used would identify and could probably drop nodes that stop responding), though in our case we could perhaps get away with just running it occasionally, probably wouldn't be an expensive script so could just run it every minute. The lb in front of all of this would then just have a list, that also ideally would be dynamic, but could be static:
cluster1.cloud
cluster2.cloud
awesomecluster.cloud
And all the traffic intended for whatever dns entry would be directed to the pool of workers for that cluster. Wouldn't matter which worker the request got to, k8s would deal with the rest once it gets to one of the workers.

Option 2 already exists. It has limitations that as I understand it, it isn't really load balancing, but live failover. I don't know if we have any high traffic projects that would experience a problem in this regard, as we would end up with a one node network bottleneck, but it might not matter. It also requires puppet knowledge, in particular of how we setup puppet, to use. This raises the barrier to entry to an extent.

I'm sorry I don't have an opinion.

Well I do have an opinion. Though it is outside of the three paths suggested above. On that front I have no real opinion.

In my view we should not deploy magnum. We are not spec'd for it. Magnum does not go into our infra cleanly, and we would be stretching ourselves thinner trying to support it. I've been working on this for about ten months and been trying to get it into something for maybe 8. That this is still a project at this point, says to me that we should continue to try to introduce.
Additionally I find magnum itself to be underwhelming. It is somewhat backdated on both k8s version and fcos versions. I find the magnum documentation to be of poor quality. While some of this is less magnum itself and more how it fits (or doesn't) into our openstack deploy, magnum takes a lot of tinkering to get it to do anything. I'm not left with a sense of confidence when I'm digging through source, or on underlying k8s nodes to figure out what is going on.