There are multiple requests for k8s hosting services; we can investigate how difficult it would be to provide this in wmcs.
Description
Details
Event Timeline
Magnum is an API and a conductor that interacts with Heat which is, itself, another API and another conductor. At first glance I don't think these will be a lot harder to set up than Trove, but I need to confirm package availability.
Change 800868 had a related patch set uploaded (by Andrew Bogott; author: Andrew Bogott):
[operations/puppet@production] Rough in manifest and files for OpenStack Magnum
Change 800868 merged by Andrew Bogott:
[operations/puppet@production] Rough in manifest and files for OpenStack Magnum
Change 801011 had a related patch set uploaded (by Andrew Bogott; author: Andrew Bogott):
[operations/puppet@production] Magnum: use internal keystone url rather than admin
Change 801011 merged by Andrew Bogott:
[operations/puppet@production] Magnum: use internal keystone url rather than admin
Change 801012 had a related patch set uploaded (by Andrew Bogott; author: Andrew Bogott):
[operations/puppet@production] Magnum: add haproxy in codfw1dev
Change 801013 had a related patch set uploaded (by Andrew Bogott; author: Andrew Bogott):
[operations/puppet@production] Heat: include transport_url for the notification section
Change 801012 merged by Andrew Bogott:
[operations/puppet@production] Magnum: add haproxy in codfw1dev
Change 801013 merged by Andrew Bogott:
[operations/puppet@production] Heat: include transport_url for the notification section
Change 801014 had a related patch set uploaded (by Andrew Bogott; author: Andrew Bogott):
[operations/puppet@production] Magnum: move api listening port away from the haproxy port
Change 801014 merged by Andrew Bogott:
[operations/puppet@production] Magnum: move api listening port away from the haproxy port
Change 803567 had a related patch set uploaded (by Andrew Bogott; author: Andrew Bogott):
[operations/puppet@production] Openstack Keystone: support creation of additional domains
Change 803568 had a related patch set uploaded (by Andrew Bogott; author: Andrew Bogott):
[operations/puppet@production] Keystone: support config for arbitrary sql-based service domains
Change 803567 merged by Andrew Bogott:
[operations/puppet@production] Openstack Keystone: support creation of additional domains
Change 803568 merged by Andrew Bogott:
[operations/puppet@production] Keystone: Include config for 'magnum' service domain in codfw1dev
Change 803593 had a related patch set uploaded (by Andrew Bogott; author: Andrew Bogott):
[operations/puppet@production] OpenStack Magnum: use a service user 'magnum' as admin of 'magnum' domain
Change 803593 merged by Andrew Bogott:
[operations/puppet@production] OpenStack Magnum: use a service user 'magnum' as admin of 'magnum' domain
Just a few details about troubleshooting:
Magnum is mostly heat, which means that the first thing to look at is 'openstack stack list'
# openstack stack list +--------------------------------------+------------------------+---------+---------------+----------------------+--------------+ | ID | Stack Name | Project | Stack Status | Creation Time | Updated Time | +--------------------------------------+------------------------+---------+---------------+----------------------+--------------+ | e34905e9-4716-4355-90cf-42bd41441554 | cluster21-isg2rijr265j | admin | CREATE_FAILED | 2022-07-04T00:15:39Z | None | +--------------------------------------+------------------------+---------+---------------+----------------------+--------------+
And then 'openstack stack resource list e34905e9-4716-4355-90cf-42bd41441554':
root@cloudcontrol2001-dev:~# openstack stack resource list e34905e9-4716-4355-90cf-42bd41441554 +-------------------------------+--------------------------------------+------------------------------------------------------------------------------------+-----------------+----------------------+ | resource_name | physical_resource_id | resource_type | resource_status | updated_time | +-------------------------------+--------------------------------------+------------------------------------------------------------------------------------+-----------------+----------------------+ | kube_cluster_deploy | | OS::Heat::SoftwareDeployment | INIT_COMPLETE | 2022-07-04T00:15:39Z | | kube_cluster_config | | OS::Heat::SoftwareConfig | INIT_COMPLETE | 2022-07-04T00:15:39Z | | secgroup_rule_tcp_kube_minion | d64a7dc1-f58c-46f3-b55f-2b305220d803 | OS::Neutron::SecurityGroupRule | CREATE_COMPLETE | 2022-07-04T00:15:39Z | | secgroup_rule_udp_kube_minion | 3518834d-cd61-46da-a467-d23642c1b376 | OS::Neutron::SecurityGroupRule | CREATE_COMPLETE | 2022-07-04T00:15:39Z | | kube_minions | | OS::Heat::ResourceGroup | INIT_COMPLETE | 2022-07-04T00:15:39Z | | secgroup_kube_minion | defde9ab-762d-4736-a14b-77840bac88d9 | OS::Neutron::SecurityGroup | CREATE_COMPLETE | 2022-07-04T00:15:39Z | | etcd_address_lb_switch | | Magnum::ApiGatewaySwitcher | INIT_COMPLETE | 2022-07-04T00:15:39Z | | worker_nodes_server_group | 9b0039af-59f4-4a1b-bd88-4e3075ceb6c3 | OS::Nova::ServerGroup | CREATE_COMPLETE | 2022-07-04T00:15:39Z | | api_address_floating_switch | | Magnum::FloatingIPAddressSwitcher | INIT_COMPLETE | 2022-07-04T00:15:39Z | | api_address_lb_switch | | Magnum::ApiGatewaySwitcher | INIT_COMPLETE | 2022-07-04T00:15:39Z | | kube_masters | 58396b3e-17a8-4970-acd6-7b57d61b0b1e | OS::Heat::ResourceGroup | CREATE_FAILED | 2022-07-04T00:15:39Z | | master_nodes_server_group | 2b93996b-5147-4e4f-94d0-5127f5d3210e | OS::Nova::ServerGroup | CREATE_COMPLETE | 2022-07-04T00:15:39Z | | secgroup_kube_master | 4f7779eb-64ec-4456-8ceb-5f220adbffdd | OS::Neutron::SecurityGroup | CREATE_COMPLETE | 2022-07-04T00:15:39Z | | api_lb | 68f370b2-66e5-4e05-94c9-9851bff53e2b | file:///usr/lib/python3/dist-packages/magnum/drivers/common/templates/lb_api.yaml | CREATE_COMPLETE | 2022-07-04T00:15:39Z | | etcd_lb | c0c8fd4d-33b8-46f8-a09c-523645327a8b | file:///usr/lib/python3/dist-packages/magnum/drivers/common/templates/lb_etcd.yaml | CREATE_COMPLETE | 2022-07-04T00:15:39Z | | network | 7c976038-eeff-44be-87a9-362b69111cfe | file:///usr/lib/python3/dist-packages/magnum/drivers/common/templates/network.yaml | CREATE_COMPLETE | 2022-07-04T00:15:39Z | +-------------------------------+--------------------------------------+------------------------------------------------------------------------------------+-----------------+----------------------+
And then you can 'openstack stack resource show' the failed resource. That isn't finding me a solution but it's at least narrowing things down.
In this case the resource that's failing is a nested resource, with physical resource id of '58396b3e-17a8-4970-acd6-7b57d61b0b1e'. We can repeat our process and drill down:
root@cloudcontrol2001-dev:~# openstack stack resource list 58396b3e-17a8-4970-acd6-7b57d61b0b1e +---------------+--------------------------------------+-----------------------------------------------------------------------------------------------------+-----------------+----------------------+ | resource_name | physical_resource_id | resource_type | resource_status | updated_time | +---------------+--------------------------------------+-----------------------------------------------------------------------------------------------------+-----------------+----------------------+ | 0 | f769a4cf-c77f-42b2-acda-a948cf654b70 | file:///usr/lib/python3/dist-packages/magnum/drivers/k8s_fedora_coreos_v1/templates/kubemaster.yaml | CREATE_FAILED | 2022-07-04T00:15:45Z | +---------------+--------------------------------------+-----------------------------------------------------------------------------------------------------+-----------------+----------------------+ root@cloudcontrol2001-dev:~# openstack stack resource list f769a4cf-c77f-42b2-acda-a948cf654b70 +-------------------------------+--------------------------------------+---------------------------------------------------+-----------------+----------------------+ | resource_name | physical_resource_id | resource_type | resource_status | updated_time | +-------------------------------+--------------------------------------+---------------------------------------------------+-----------------+----------------------+ | etcd_pool_member | 1f2c3c1e-3496-4261-9255-d44011dafbc1 | Magnum::Optional::Neutron::LBaaS::PoolMember | CREATE_COMPLETE | 2022-07-04T00:15:47Z | | docker_volume_attach | | Magnum::Optional::Cinder::VolumeAttachment | INIT_COMPLETE | 2022-07-04T00:15:47Z | | etcd_volume_attach | | Magnum::Optional::Etcd::VolumeAttachment | INIT_COMPLETE | 2022-07-04T00:15:47Z | | master_config_deployment | | OS::Heat::SoftwareDeployment | INIT_COMPLETE | 2022-07-04T00:15:47Z | | master_config | | OS::Heat::SoftwareConfig | INIT_COMPLETE | 2022-07-04T00:15:47Z | | docker_volume | 01fe6ef0-41c5-4a75-8af7-d586926d0da6 | Magnum::Optional::Cinder::Volume | CREATE_COMPLETE | 2022-07-04T00:15:47Z | | etcd_volume | 66bca470-d08c-41ae-acff-40ec1c305113 | Magnum::Optional::Etcd::Volume | CREATE_COMPLETE | 2022-07-04T00:15:47Z | | api_address_switch | | Magnum::ApiGatewaySwitcher | INIT_COMPLETE | 2022-07-04T00:15:47Z | | kube_master_floating | | Magnum::Optional::KubeMaster::Neutron::FloatingIP | INIT_COMPLETE | 2022-07-04T00:15:47Z | | upgrade_kubernetes_deployment | | OS::Heat::SoftwareDeployment | INIT_COMPLETE | 2022-07-04T00:15:47Z | | upgrade_kubernetes | | OS::Heat::SoftwareConfig | INIT_COMPLETE | 2022-07-04T00:15:47Z | | kube-master | | OS::Nova::Server | CREATE_FAILED | 2022-07-04T00:15:47Z | | agent_config | 9bb76102-105f-484c-a1f3-6977264cc4d9 | OS::Heat::SoftwareConfig | CREATE_COMPLETE | 2022-07-04T00:15:47Z | | api_pool_member | 4626aef7-62d3-470b-a672-34791262f9f3 | Magnum::Optional::Neutron::LBaaS::PoolMember | CREATE_COMPLETE | 2022-07-04T00:15:47Z | | kube_master_eth0 | 1f234eb0-be4e-49a3-b3dc-1e1f46a3878c | OS::Neutron::Port | CREATE_COMPLETE | 2022-07-04T00:15:47Z | +-------------------------------+--------------------------------------+---------------------------------------------------+-----------------+----------------------+
so now we know what's failing to get created: a nova server
Fyi. codfw setup is currently broken:
root@cloudcontrol2001-dev:~# neutron agent-list neutron CLI is deprecated and will be removed in the future. Use openstack CLI instead. +--------------------------------------+--------------------+-------------------+-------------------+-------+----------------+---------------------------+ | id | agent_type | host | availability_zone | alive | admin_state_up | binary | +--------------------------------------+--------------------+-------------------+-------------------+-------+----------------+---------------------------+ | 06d461b8-b9ec-45a3-8c6e-ef56f22c721b | DHCP agent | cloudnet2006-dev | nova | xxx | True | neutron-dhcp-agent | | 46573e30-a4f0-4424-84c5-e18d7a1d0902 | Linux bridge agent | cloudvirt2003-dev | | xxx | True | neutron-linuxbridge-agent | | 4ce9e60e-797d-47db-8e60-5d01405799eb | L3 agent | cloudnet2006-dev | nova | xxx | True | neutron-l3-agent | | 503a6978-1545-47e7-9272-8be3e1140825 | Metadata agent | cloudnet2005-dev | | xxx | True | neutron-metadata-agent | | 59bc1a4d-5bbe-4035-a1cc-5e9a0cc790b2 | DHCP agent | cloudnet2005-dev | nova | xxx | True | neutron-dhcp-agent | | 73206678-6394-4d0e-9668-2c6cdf28b595 | Linux bridge agent | cloudvirt2002-dev | | xxx | True | neutron-linuxbridge-agent | | 73361b68-276d-45a6-87a4-2b704a56dedb | L3 agent | cloudnet2005-dev | nova | xxx | True | neutron-l3-agent | | 905782d2-fcd7-49ac-b499-8c068057c0a5 | Linux bridge agent | cloudnet2005-dev | | xxx | True | neutron-linuxbridge-agent | | 98f75540-ec40-4b32-be19-33dd3c24c5b5 | Linux bridge agent | cloudvirt2001-dev | | xxx | True | neutron-linuxbridge-agent | | ac55fc68-6811-43eb-9d1c-f0a22f42eb18 | Metadata agent | cloudnet2006-dev | | :-) | True | neutron-metadata-agent | | e9cde754-b603-47c1-97b9-9ac2d74d043a | Linux bridge agent | cloudnet2006-dev | | xxx | True | neutron-linuxbridge-agent | +--------------------------------------+--------------------+-------------------+-------------------+-------+----------------+---------------------------+
I'll give it a go to getting it up again, but it's not clear to me what happened (yet at least, maybe I step on your toes).
Change 825676 had a related patch set uploaded (by Vivian Rook; author: Vivian Rook):
[operations/puppet@production] Allow cloud_provider_enabled
Change 842863 had a related patch set uploaded (by Andrew Bogott; author: Andrew Bogott):
[operations/puppet@production] Magnum: use magnum-specific rabbitmq user rather than the generic 'rabbit'
Change 842864 had a related patch set uploaded (by Andrew Bogott; author: Andrew Bogott):
[labs/private@master] Add dummy rabbitmq passwords for Magnum
Change 842864 merged by Andrew Bogott:
[labs/private@master] Add dummy rabbitmq passwords for Magnum
Change 842863 merged by Andrew Bogott:
[operations/puppet@production] Magnum: use magnum-specific rabbitmq user rather than the shared 'nova'
Change 842865 had a related patch set uploaded (by Andrew Bogott; author: Andrew Bogott):
[operations/puppet@production] Add OpenStack Magnum to eqiad1
Change 842865 merged by Andrew Bogott:
[operations/puppet@production] Add OpenStack Magnum to eqiad1
Change 842869 had a related patch set uploaded (by Andrew Bogott; author: Andrew Bogott):
[operations/puppet@production] Add haproxy entry for magnum on eqiad1
Change 842869 merged by Andrew Bogott:
[operations/puppet@production] Add haproxy entry for magnum on eqiad1
In order to deploy magnum we're going to need a Fedora CoreOS image (The container, so not Fedora) https://getfedora.org/en/coreos I suspect if we ran something like
openstack image create Fedora-CoreOS-34 --file=fedora-coreos-34.20210518.3.0-openstack.x86_64.qcow2 --disk-format=qcow2 --container-format=bare --property os_distro='fedora-coreos' --public
anyone launching an image would start seeing Fedora-CoreOS-34, as this image is not meant to be launched by anything but a magnum cluster, I suspect this would cause some confusion. Can it be obscured?
Also I think we're going to need https://gerrit.wikimedia.org/r/c/operations/puppet/+/825676 before we can deploy beyond k8s 1.18
https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/VM_images#Restricted_images is probably the most straight forward way. Historically we have used restricted images for a number of "weird" images that should only be seen by a few projects.
Thanks @bd808 that should get things moving.
Some notes on what has been done so far in prod:
wget https://builds.coreos.fedoraproject.org/prod/streams/stable/builds/34.20210518.3.0/x86_64/fedora-coreos-34.20210518.3.0-openstack.x86_64.qcow2.xz unxz fedora-coreos-34.20210518.3.0-openstack.x86_64.qcow2.xz openstack image create magnum-fedora-coreos-34 --file=fedora-coreos-34.20210518.3.0-openstack.x86_64.qcow2 --disk-format=qcow2 --container-format=bare --property os_distro='fedora-coreos' --public openstack image set --property visibility=shared --project testlabs magnum-fedora-coreos-34 openstack image set --activate magnum-fedora-coreos-34 openstack coe cluster template create core-34-k8s21-100g --image magnum-fedora-coreos-34 --external-network wan-transport-eqiad --fixed-network lan-flat-cloudinstances2b --fixed-subnet lan-flat-cloudinstances2b --dns-nameserver 8.8.8.8 --network-driver flannel --docker-storage-driver overlay2 --docker-volume-size 100 --master-flavor g3.cores1.ram2.disk20 --flavor g3.cores1.ram2.disk20 --coe kubernetes --labels kube_tag=v1.21.8-rancher1-linux-amd64,hyperkube_prefix=docker.io/rancher/,cloud_provider_enabled=true
Change 825676 merged by Andrew Bogott:
[operations/puppet@production] Allow cloud_provider_enabled
Testlabs secgroup-rules quota bumped to 200 (from 100) as magnum was hitting a quota limit.