Page MenuHomePhabricator

Research Openstack Deployment Paradigms
Closed, ResolvedPublic

Description

Our current openstack deployment utilizes a simple flat networking setup that requires usage of a now seemingly 'deprecated' / 'experimental' technology in upstream Openstack T326373: Neutron linuxbridge 'experimental' in Zed. It also prevents us from adopting useful features, such as tenant networking T270694: CloudVPS: introduce tenant networks, Octavia (load balancers), that cause issues or limitations when adopting new features. For example, T321220: Openstack Magnum network setup.

Therefore, as noted in T326373#8508661, we should plan a migration off a linuxbridge agent. To do so, we need to redeploy Openstack. Given the age of our Openstack deployment, and the fact we do not currently have a repeatable deployment (that is, the ability to redeploy our cloud on demand), let's explore current methods utilized by upstream and other cloud operators.

Consider:

Others welcome! See also: https://www.openstack.org/software/project-navigator/deployment-tools

Event Timeline

nskaggs updated the task description. (Show Details)

Started evaluating kolla-ansible, found a blocker in nova-compute not starting due to it unable to auth to libvirtd, and opened an upstream bug https://bugs.launchpad.net/kolla-ansible/+bug/2004579

aborrero triaged this task as High priority.Feb 6 2023, 1:40 PM

Tried running kolla-ansible locally on my laptop with a KVM/libvirt VM but found (and reported upstream) two bugs (not related to openstack this time):

Started evaluating kolla-ansible, found a blocker in nova-compute not starting due to it unable to auth to libvirtd, and opened an upstream bug https://bugs.launchpad.net/kolla-ansible/+bug/2004579

I can't reproduce this problem on a VM in my laptop. There is some kind of bad interaction for running kolla-ansible inside Cloud VPS :-(

  • zed @ Cloud VPS: libvirt <-> nova-compute auth failure
  • yoga @ Cloud VPS: libvirt <-> nova-compute auth failure
  • zed @ my laptop: all fine regarding libvirt & nova-compute

Have you checked if the image you use for the VM is the same? Maybe the one on Cloud is missing something/user setup/etc.

Incoming news, TripleO is being discontinued upstream.

Started evaluating kolla-ansible, found a blocker in nova-compute not starting due to it unable to auth to libvirtd, and opened an upstream bug https://bugs.launchpad.net/kolla-ansible/+bug/2004579

I can't reproduce this problem on a VM in my laptop. There is some kind of bad interaction for running kolla-ansible inside Cloud VPS :-(

  • zed @ Cloud VPS: libvirt <-> nova-compute auth failure
  • yoga @ Cloud VPS: libvirt <-> nova-compute auth failure
  • zed @ my laptop: all fine regarding libvirt & nova-compute

I found the problem. We're affected by https://bugs.launchpad.net/kolla-ansible/+bug/1989791 The bug I reported is a duplicate, but I was unaware about the hostname being the root problem.

The workaround is simple, in /etc/hosts:

--- hosts_old	2023-02-10 12:22:40.241881291 +0000
+++ hosts_new	2023-02-10 12:22:05.537866714 +0000
@@ -5,7 +5,7 @@
 # b.) change or remove the value of 'manage_etc_hosts' in
 #     /etc/cloud/cloud.cfg or cloud-config from user-data
 #
-172.16.3.208 kolla-test.testlabs.eqiad1.wikimedia.cloud kolla-test
+172.16.3.208 kolla-test kolla-test.testlabs.eqiad1.wikimedia.cloud
 127.0.0.1 localhost
 
 # The following lines are desirable for IPv6 capable hosts

Before the change:

root@kolla-test:~# hostname -f
kolla-test.testlabs.eqiad1.wikimedia.cloud

After the change:

root@kolla-test:~# hostname -f
kolla-test

With the change libvirt and nova-compute daemons can communicate.

This /etc/host file is generated by us via cloud-init most likely.

Change 888222 had a related patch set uploaded (by Arturo Borrero Gonzalez; author: Arturo Borrero Gonzalez):

[operations/puppet@production] cloud: openstack: add kolla-ansible evaluation recipe

https://gerrit.wikimedia.org/r/888222

Change 888222 merged by Arturo Borrero Gonzalez:

[operations/puppet@production] cloud: openstack: add kolla-ansible evaluation recipe

https://gerrit.wikimedia.org/r/888222

NOTE: you can now evaluate yourself openstack @ Cloud VPS using kolla-ansible

Steps are:

  • try from a VM with the puppet role::wmcs::openstack::kolla_ansible_evaluation
  • then run wmcs-kolla-ansible-evaluation.sh

Wait and enjoy.

Note, by default openstack-ansible all-in-one setup renames the VM hostname, introducing a severe drift wrt. other Cloud VPS context (like puppet, etc), therefore making it difficult to operate inside Cloud VPS for evaluation & testing purposes. Will investigate next if this renaming can be disabled.

Note, by default openstack-ansible all-in-one setup renames the VM hostname, introducing a severe drift wrt. other Cloud VPS context (like puppet, etc), therefore making it difficult to operate inside Cloud VPS for evaluation & testing purposes. Will investigate next if this renaming can be disabled.

I know now why this happens.

The task is guarded by a prepare-hostname tag, which in theory makes it possible to skip using ansible-playbook --skip-tags prepare-hostname, however:

  • the wrapper script bootstrap-aio.sh doesn't seem to have an option to pass the --skip-tags argument to the inner ansible-playbook call. We would need to patch that ourselves in the script (no big deal, but means a patch in this early stage...)
  • there is a comment in the code hinting that the target hostname aio1 is expected per the inventory.
  • I understand the use case for the hard hostname, since they are using this AIO mode to test openstack itself, and they need strict control of all the environment.

Therefore I don't see an easy way to play with openstack-ansible in AIO mode within Cloud VPS VMs.
I'll run another test, not using the AIO mode and will see if things behave differently.

Therefore I don't see an easy way to play with openstack-ansible in AIO mode within Cloud VPS VMs.

To be clear, the AIO mode works just great. The thing with the hostname rename is that it effectively removes the VM from puppet control. Which makes me nervous.

Another option is to make sure we're using an aio1 hostname from the beginning. Will try that!

Change 895789 had a related patch set uploaded (by Arturo Borrero Gonzalez; author: Arturo Borrero Gonzalez):

[operations/puppet@production] openstack: create openstack-ansible evaluation role

https://gerrit.wikimedia.org/r/895789

Another option is to make sure we're using an aio1 hostname from the beginning. Will try that!

Still unable to operate. After a while, openstack-ansible tries to ssh using the aio1 name rather than localhost, which clashes with the SSH configuration we have for the host via puppet:

TASK [Ensure python is installed] ********************************************************************************************************************************************************************************************************************************************************************************************
fatal: [aio1]: UNREACHABLE! => {"changed": false, "msg": "Failed to connect to the host via ssh: Warning: Permanently added '172.29.236.100' (ECDSA) to the list of known hosts.\r\nroot@172.29.236.100: Permission denied (publickey).", "unreachable": true}

Also, the hostname rename also involves changing the domain (from whatever.eqiad1.wikimedia.cloud to aio1.openstack.local, meaning the puppet agent certificate gets invalidated.

I'm starting to have doubts about if it is worth keep trying to run openstack-ansible inside Cloud VPS.

Change 895789 abandoned by Arturo Borrero Gonzalez:

[operations/puppet@production] openstack: create openstack-ansible evaluation role

Reason:

don't merging this as the script doesn't work as expected.

https://gerrit.wikimedia.org/r/895789

We should evaluate as well https://docs.airshipit.org/

Regarding airship: the community has been very quiet for a little while now. There are mainly maintenance activities going on for the components that are running in production at a few companies, carried out by only a handful of people.
Probably not the level of activity/adoption/support we're looking for.

After the research was completed, my recommendation was to try going with kolla-ansible.

However, I recently attended Kubecon EU and now I'm feeling like re-evaluating my conclusions to suggest openstack-helm instead.
I think that having an undercloud kubernetes deployment as bedrock would be very beneficial, even if we have to pay the price of the extra abstraction layer (k8s):

  • I believe openstack-helm deploys everything inside a k8s namespace. We could simply have a IaaS deploy per kubernetes namespace, one of them being the Cloud VPS service, while opening the door for others in other namespaces.
  • we already have puppet code to install a kubernetes cluster to baremetal (kubeadm, which is what toolforge uses).
  • we could leverage lessons learned from toolforge lima-kilo and develop a local development cloud from day 0 based on kubernetes. Imagine running Cloud VPS on your laptop for testing / devel purposes.
  • the cloud-native architecture, and in that regard it should be better suited for the next 10 years of industry evolution