Page MenuHomePhabricator

Investigate Openstack Magnum
Closed, ResolvedPublic

Description

There are multiple requests for k8s hosting services; we can investigate how difficult it would be to provide this in wmcs.

Event Timeline

Magnum is an API and a conductor that interacts with Heat which is, itself, another API and another conductor. At first glance I don't think these will be a lot harder to set up than Trove, but I need to confirm package availability.

The packages appear to be available. Next step would be to try it out.

Andrew changed the task status from Open to Stalled.May 14 2021, 2:12 PM
Andrew removed Andrew as the assignee of this task.
rook changed the task status from Stalled to In Progress.Jan 14 2022, 10:38 AM
rook claimed this task.

Change 800868 had a related patch set uploaded (by Andrew Bogott; author: Andrew Bogott):

[operations/puppet@production] Rough in manifest and files for OpenStack Magnum

https://gerrit.wikimedia.org/r/800868

Change 800868 merged by Andrew Bogott:

[operations/puppet@production] Rough in manifest and files for OpenStack Magnum

https://gerrit.wikimedia.org/r/800868

Change 801011 had a related patch set uploaded (by Andrew Bogott; author: Andrew Bogott):

[operations/puppet@production] Magnum: use internal keystone url rather than admin

https://gerrit.wikimedia.org/r/801011

Change 801011 merged by Andrew Bogott:

[operations/puppet@production] Magnum: use internal keystone url rather than admin

https://gerrit.wikimedia.org/r/801011

Change 801012 had a related patch set uploaded (by Andrew Bogott; author: Andrew Bogott):

[operations/puppet@production] Magnum: add haproxy in codfw1dev

https://gerrit.wikimedia.org/r/801012

Change 801013 had a related patch set uploaded (by Andrew Bogott; author: Andrew Bogott):

[operations/puppet@production] Heat: include transport_url for the notification section

https://gerrit.wikimedia.org/r/801013

Change 801012 merged by Andrew Bogott:

[operations/puppet@production] Magnum: add haproxy in codfw1dev

https://gerrit.wikimedia.org/r/801012

Change 801013 merged by Andrew Bogott:

[operations/puppet@production] Heat: include transport_url for the notification section

https://gerrit.wikimedia.org/r/801013

Change 801014 had a related patch set uploaded (by Andrew Bogott; author: Andrew Bogott):

[operations/puppet@production] Magnum: move api listening port away from the haproxy port

https://gerrit.wikimedia.org/r/801014

Change 801014 merged by Andrew Bogott:

[operations/puppet@production] Magnum: move api listening port away from the haproxy port

https://gerrit.wikimedia.org/r/801014

Change 803567 had a related patch set uploaded (by Andrew Bogott; author: Andrew Bogott):

[operations/puppet@production] Openstack Keystone: support creation of additional domains

https://gerrit.wikimedia.org/r/803567

Change 803568 had a related patch set uploaded (by Andrew Bogott; author: Andrew Bogott):

[operations/puppet@production] Keystone: support config for arbitrary sql-based service domains

https://gerrit.wikimedia.org/r/803568

Change 803567 merged by Andrew Bogott:

[operations/puppet@production] Openstack Keystone: support creation of additional domains

https://gerrit.wikimedia.org/r/803567

Change 803568 merged by Andrew Bogott:

[operations/puppet@production] Keystone: Include config for 'magnum' service domain in codfw1dev

https://gerrit.wikimedia.org/r/803568

Change 803593 had a related patch set uploaded (by Andrew Bogott; author: Andrew Bogott):

[operations/puppet@production] OpenStack Magnum: use a service user 'magnum' as admin of 'magnum' domain

https://gerrit.wikimedia.org/r/803593

Change 803593 merged by Andrew Bogott:

[operations/puppet@production] OpenStack Magnum: use a service user 'magnum' as admin of 'magnum' domain

https://gerrit.wikimedia.org/r/803593

Just a few details about troubleshooting:

Magnum is mostly heat, which means that the first thing to look at is 'openstack stack list'

# openstack stack list
+--------------------------------------+------------------------+---------+---------------+----------------------+--------------+
| ID                                   | Stack Name             | Project | Stack Status  | Creation Time        | Updated Time |
+--------------------------------------+------------------------+---------+---------------+----------------------+--------------+
| e34905e9-4716-4355-90cf-42bd41441554 | cluster21-isg2rijr265j | admin   | CREATE_FAILED | 2022-07-04T00:15:39Z | None         |
+--------------------------------------+------------------------+---------+---------------+----------------------+--------------+

And then 'openstack stack resource list e34905e9-4716-4355-90cf-42bd41441554':

root@cloudcontrol2001-dev:~# openstack stack resource list e34905e9-4716-4355-90cf-42bd41441554 
+-------------------------------+--------------------------------------+------------------------------------------------------------------------------------+-----------------+----------------------+
| resource_name                 | physical_resource_id                 | resource_type                                                                      | resource_status | updated_time         |
+-------------------------------+--------------------------------------+------------------------------------------------------------------------------------+-----------------+----------------------+
| kube_cluster_deploy           |                                      | OS::Heat::SoftwareDeployment                                                       | INIT_COMPLETE   | 2022-07-04T00:15:39Z |
| kube_cluster_config           |                                      | OS::Heat::SoftwareConfig                                                           | INIT_COMPLETE   | 2022-07-04T00:15:39Z |
| secgroup_rule_tcp_kube_minion | d64a7dc1-f58c-46f3-b55f-2b305220d803 | OS::Neutron::SecurityGroupRule                                                     | CREATE_COMPLETE | 2022-07-04T00:15:39Z |
| secgroup_rule_udp_kube_minion | 3518834d-cd61-46da-a467-d23642c1b376 | OS::Neutron::SecurityGroupRule                                                     | CREATE_COMPLETE | 2022-07-04T00:15:39Z |
| kube_minions                  |                                      | OS::Heat::ResourceGroup                                                            | INIT_COMPLETE   | 2022-07-04T00:15:39Z |
| secgroup_kube_minion          | defde9ab-762d-4736-a14b-77840bac88d9 | OS::Neutron::SecurityGroup                                                         | CREATE_COMPLETE | 2022-07-04T00:15:39Z |
| etcd_address_lb_switch        |                                      | Magnum::ApiGatewaySwitcher                                                         | INIT_COMPLETE   | 2022-07-04T00:15:39Z |
| worker_nodes_server_group     | 9b0039af-59f4-4a1b-bd88-4e3075ceb6c3 | OS::Nova::ServerGroup                                                              | CREATE_COMPLETE | 2022-07-04T00:15:39Z |
| api_address_floating_switch   |                                      | Magnum::FloatingIPAddressSwitcher                                                  | INIT_COMPLETE   | 2022-07-04T00:15:39Z |
| api_address_lb_switch         |                                      | Magnum::ApiGatewaySwitcher                                                         | INIT_COMPLETE   | 2022-07-04T00:15:39Z |
| kube_masters                  | 58396b3e-17a8-4970-acd6-7b57d61b0b1e | OS::Heat::ResourceGroup                                                            | CREATE_FAILED   | 2022-07-04T00:15:39Z |
| master_nodes_server_group     | 2b93996b-5147-4e4f-94d0-5127f5d3210e | OS::Nova::ServerGroup                                                              | CREATE_COMPLETE | 2022-07-04T00:15:39Z |
| secgroup_kube_master          | 4f7779eb-64ec-4456-8ceb-5f220adbffdd | OS::Neutron::SecurityGroup                                                         | CREATE_COMPLETE | 2022-07-04T00:15:39Z |
| api_lb                        | 68f370b2-66e5-4e05-94c9-9851bff53e2b | file:///usr/lib/python3/dist-packages/magnum/drivers/common/templates/lb_api.yaml  | CREATE_COMPLETE | 2022-07-04T00:15:39Z |
| etcd_lb                       | c0c8fd4d-33b8-46f8-a09c-523645327a8b | file:///usr/lib/python3/dist-packages/magnum/drivers/common/templates/lb_etcd.yaml | CREATE_COMPLETE | 2022-07-04T00:15:39Z |
| network                       | 7c976038-eeff-44be-87a9-362b69111cfe | file:///usr/lib/python3/dist-packages/magnum/drivers/common/templates/network.yaml | CREATE_COMPLETE | 2022-07-04T00:15:39Z |
+-------------------------------+--------------------------------------+------------------------------------------------------------------------------------+-----------------+----------------------+

And then you can 'openstack stack resource show' the failed resource. That isn't finding me a solution but it's at least narrowing things down.

In this case the resource that's failing is a nested resource, with physical resource id of '58396b3e-17a8-4970-acd6-7b57d61b0b1e'. We can repeat our process and drill down:

root@cloudcontrol2001-dev:~# openstack stack resource list 58396b3e-17a8-4970-acd6-7b57d61b0b1e
+---------------+--------------------------------------+-----------------------------------------------------------------------------------------------------+-----------------+----------------------+
| resource_name | physical_resource_id                 | resource_type                                                                                       | resource_status | updated_time         |
+---------------+--------------------------------------+-----------------------------------------------------------------------------------------------------+-----------------+----------------------+
| 0             | f769a4cf-c77f-42b2-acda-a948cf654b70 | file:///usr/lib/python3/dist-packages/magnum/drivers/k8s_fedora_coreos_v1/templates/kubemaster.yaml | CREATE_FAILED   | 2022-07-04T00:15:45Z |
+---------------+--------------------------------------+-----------------------------------------------------------------------------------------------------+-----------------+----------------------+
root@cloudcontrol2001-dev:~# openstack stack resource list f769a4cf-c77f-42b2-acda-a948cf654b70
+-------------------------------+--------------------------------------+---------------------------------------------------+-----------------+----------------------+
| resource_name                 | physical_resource_id                 | resource_type                                     | resource_status | updated_time         |
+-------------------------------+--------------------------------------+---------------------------------------------------+-----------------+----------------------+
| etcd_pool_member              | 1f2c3c1e-3496-4261-9255-d44011dafbc1 | Magnum::Optional::Neutron::LBaaS::PoolMember      | CREATE_COMPLETE | 2022-07-04T00:15:47Z |
| docker_volume_attach          |                                      | Magnum::Optional::Cinder::VolumeAttachment        | INIT_COMPLETE   | 2022-07-04T00:15:47Z |
| etcd_volume_attach            |                                      | Magnum::Optional::Etcd::VolumeAttachment          | INIT_COMPLETE   | 2022-07-04T00:15:47Z |
| master_config_deployment      |                                      | OS::Heat::SoftwareDeployment                      | INIT_COMPLETE   | 2022-07-04T00:15:47Z |
| master_config                 |                                      | OS::Heat::SoftwareConfig                          | INIT_COMPLETE   | 2022-07-04T00:15:47Z |
| docker_volume                 | 01fe6ef0-41c5-4a75-8af7-d586926d0da6 | Magnum::Optional::Cinder::Volume                  | CREATE_COMPLETE | 2022-07-04T00:15:47Z |
| etcd_volume                   | 66bca470-d08c-41ae-acff-40ec1c305113 | Magnum::Optional::Etcd::Volume                    | CREATE_COMPLETE | 2022-07-04T00:15:47Z |
| api_address_switch            |                                      | Magnum::ApiGatewaySwitcher                        | INIT_COMPLETE   | 2022-07-04T00:15:47Z |
| kube_master_floating          |                                      | Magnum::Optional::KubeMaster::Neutron::FloatingIP | INIT_COMPLETE   | 2022-07-04T00:15:47Z |
| upgrade_kubernetes_deployment |                                      | OS::Heat::SoftwareDeployment                      | INIT_COMPLETE   | 2022-07-04T00:15:47Z |
| upgrade_kubernetes            |                                      | OS::Heat::SoftwareConfig                          | INIT_COMPLETE   | 2022-07-04T00:15:47Z |
| kube-master                   |                                      | OS::Nova::Server                                  | CREATE_FAILED   | 2022-07-04T00:15:47Z |
| agent_config                  | 9bb76102-105f-484c-a1f3-6977264cc4d9 | OS::Heat::SoftwareConfig                          | CREATE_COMPLETE | 2022-07-04T00:15:47Z |
| api_pool_member               | 4626aef7-62d3-470b-a672-34791262f9f3 | Magnum::Optional::Neutron::LBaaS::PoolMember      | CREATE_COMPLETE | 2022-07-04T00:15:47Z |
| kube_master_eth0              | 1f234eb0-be4e-49a3-b3dc-1e1f46a3878c | OS::Neutron::Port                                 | CREATE_COMPLETE | 2022-07-04T00:15:47Z |
+-------------------------------+--------------------------------------+---------------------------------------------------+-----------------+----------------------+

so now we know what's failing to get created: a nova server

Fyi. codfw setup is currently broken:

root@cloudcontrol2001-dev:~# neutron agent-list
neutron CLI is deprecated and will be removed in the future. Use openstack CLI instead.
+--------------------------------------+--------------------+-------------------+-------------------+-------+----------------+---------------------------+
| id                                   | agent_type         | host              | availability_zone | alive | admin_state_up | binary                    |
+--------------------------------------+--------------------+-------------------+-------------------+-------+----------------+---------------------------+
| 06d461b8-b9ec-45a3-8c6e-ef56f22c721b | DHCP agent         | cloudnet2006-dev  | nova              | xxx   | True           | neutron-dhcp-agent        |
| 46573e30-a4f0-4424-84c5-e18d7a1d0902 | Linux bridge agent | cloudvirt2003-dev |                   | xxx   | True           | neutron-linuxbridge-agent |
| 4ce9e60e-797d-47db-8e60-5d01405799eb | L3 agent           | cloudnet2006-dev  | nova              | xxx   | True           | neutron-l3-agent          |
| 503a6978-1545-47e7-9272-8be3e1140825 | Metadata agent     | cloudnet2005-dev  |                   | xxx   | True           | neutron-metadata-agent    |
| 59bc1a4d-5bbe-4035-a1cc-5e9a0cc790b2 | DHCP agent         | cloudnet2005-dev  | nova              | xxx   | True           | neutron-dhcp-agent        |
| 73206678-6394-4d0e-9668-2c6cdf28b595 | Linux bridge agent | cloudvirt2002-dev |                   | xxx   | True           | neutron-linuxbridge-agent |
| 73361b68-276d-45a6-87a4-2b704a56dedb | L3 agent           | cloudnet2005-dev  | nova              | xxx   | True           | neutron-l3-agent          |
| 905782d2-fcd7-49ac-b499-8c068057c0a5 | Linux bridge agent | cloudnet2005-dev  |                   | xxx   | True           | neutron-linuxbridge-agent |
| 98f75540-ec40-4b32-be19-33dd3c24c5b5 | Linux bridge agent | cloudvirt2001-dev |                   | xxx   | True           | neutron-linuxbridge-agent |
| ac55fc68-6811-43eb-9d1c-f0a22f42eb18 | Metadata agent     | cloudnet2006-dev  |                   | :-)   | True           | neutron-metadata-agent    |
| e9cde754-b603-47c1-97b9-9ac2d74d043a | Linux bridge agent | cloudnet2006-dev  |                   | xxx   | True           | neutron-linuxbridge-agent |
+--------------------------------------+--------------------+-------------------+-------------------+-------+----------------+---------------------------+

I'll give it a go to getting it up again, but it's not clear to me what happened (yet at least, maybe I step on your toes).

Change 825676 had a related patch set uploaded (by Vivian Rook; author: Vivian Rook):

[operations/puppet@production] Allow cloud_provider_enabled

https://gerrit.wikimedia.org/r/825676

Change 842863 had a related patch set uploaded (by Andrew Bogott; author: Andrew Bogott):

[operations/puppet@production] Magnum: use magnum-specific rabbitmq user rather than the generic 'rabbit'

https://gerrit.wikimedia.org/r/842863

Change 842864 had a related patch set uploaded (by Andrew Bogott; author: Andrew Bogott):

[labs/private@master] Add dummy rabbitmq passwords for Magnum

https://gerrit.wikimedia.org/r/842864

Change 842864 merged by Andrew Bogott:

[labs/private@master] Add dummy rabbitmq passwords for Magnum

https://gerrit.wikimedia.org/r/842864

Change 842863 merged by Andrew Bogott:

[operations/puppet@production] Magnum: use magnum-specific rabbitmq user rather than the shared 'nova'

https://gerrit.wikimedia.org/r/842863

Change 842865 had a related patch set uploaded (by Andrew Bogott; author: Andrew Bogott):

[operations/puppet@production] Add OpenStack Magnum to eqiad1

https://gerrit.wikimedia.org/r/842865

Change 842865 merged by Andrew Bogott:

[operations/puppet@production] Add OpenStack Magnum to eqiad1

https://gerrit.wikimedia.org/r/842865

Change 842869 had a related patch set uploaded (by Andrew Bogott; author: Andrew Bogott):

[operations/puppet@production] Add haproxy entry for magnum on eqiad1

https://gerrit.wikimedia.org/r/842869

Change 842869 merged by Andrew Bogott:

[operations/puppet@production] Add haproxy entry for magnum on eqiad1

https://gerrit.wikimedia.org/r/842869

In order to deploy magnum we're going to need a Fedora CoreOS image (The container, so not Fedora) https://getfedora.org/en/coreos I suspect if we ran something like
openstack image create Fedora-CoreOS-34 --file=fedora-coreos-34.20210518.3.0-openstack.x86_64.qcow2 --disk-format=qcow2 --container-format=bare --property os_distro='fedora-coreos' --public
anyone launching an image would start seeing Fedora-CoreOS-34, as this image is not meant to be launched by anything but a magnum cluster, I suspect this would cause some confusion. Can it be obscured?

Also I think we're going to need https://gerrit.wikimedia.org/r/c/operations/puppet/+/825676 before we can deploy beyond k8s 1.18

anyone launching an image would start seeing Fedora-CoreOS-34, as this image is not meant to be launched by anything but a magnum cluster, I suspect this would cause some confusion. Can it be obscured?

https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/VM_images#Restricted_images is probably the most straight forward way. Historically we have used restricted images for a number of "weird" images that should only be seen by a few projects.

Thanks @bd808 that should get things moving.

Some notes on what has been done so far in prod:

wget https://builds.coreos.fedoraproject.org/prod/streams/stable/builds/34.20210518.3.0/x86_64/fedora-coreos-34.20210518.3.0-openstack.x86_64.qcow2.xz
unxz fedora-coreos-34.20210518.3.0-openstack.x86_64.qcow2.xz
openstack image create magnum-fedora-coreos-34 --file=fedora-coreos-34.20210518.3.0-openstack.x86_64.qcow2 --disk-format=qcow2 --container-format=bare --property os_distro='fedora-coreos' --public
openstack image set --property visibility=shared --project testlabs magnum-fedora-coreos-34
openstack image set --activate magnum-fedora-coreos-34
openstack coe cluster template create core-34-k8s21-100g --image magnum-fedora-coreos-34 --external-network wan-transport-eqiad --fixed-network lan-flat-cloudinstances2b --fixed-subnet lan-flat-cloudinstances2b --dns-nameserver 8.8.8.8 --network-driver flannel --docker-storage-driver overlay2 --docker-volume-size 100 --master-flavor g3.cores1.ram2.disk20 --flavor g3.cores1.ram2.disk20 --coe kubernetes --labels kube_tag=v1.21.8-rancher1-linux-amd64,hyperkube_prefix=docker.io/rancher/,cloud_provider_enabled=true

Change 825676 merged by Andrew Bogott:

[operations/puppet@production] Allow cloud_provider_enabled

https://gerrit.wikimedia.org/r/825676

Testlabs secgroup-rules quota bumped to 200 (from 100) as magnum was hitting a quota limit.

rook changed the status of subtask T321222: Storage for testlabs from Open to In Progress.Oct 20 2022, 10:13 AM