CloudVPS: upgrade: jessie -> stretch & mitaka -> newton
Open, NormalPublic

Description

We need to upgrade our Cloud VPS infra:

  • from Debian Jessie to Debian Stretch
  • from Mitaka to Newton

Since there is no way to install Newton in Jessie, we need to try to install Mitaka in Stretch and then upgrade Mitaka -> Newton.

All Mitaka packages are in the jessie-backports Debian repository, and we may try hacking doing the trick of using that repo in Stretch.
My proposal is to try something like this:

  1. Image a server with Debian Stretch
  2. Enable the jessie-backports repo
  3. Install Mitaka packages from that repo
  4. Upgrade Mitaka (jessie-backports) to Newton (stretch)
  5. Cleanup any remaining jessie-backports package and use only what's provided in stretch

PS: There is a small summary of the versioning matrix in T169099#4676842

aborrero created this task.Wed, Dec 19, 1:38 PM
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptWed, Dec 19, 1:38 PM
aborrero triaged this task as Normal priority.Wed, Dec 19, 1:38 PM
aborrero moved this task from Inbox to Doing on the cloud-services-team (Kanban) board.
aborrero updated the task description. (Show Details)
GTirloni added a subscriber: GTirloni.EditedWed, Dec 19, 2:29 PM

If we are going to upgrade to Newton anyway, should we worry if the base OS is Jessie or Stretch? I feel like going to Stretch is secondary and that the Mitaka/Newton effort is what is going to be more problematic. Am I underestimating the effort to switch to Stretch? Is that why we want to go Stretch/Mitaka first?

I think T169099 is probably relevant for matching Release/OpenStack

I read the comments there but still don't understand. If we're planning to skip Ocata and go to Newton, then Stretch having Newton makes going to Jessie/Newton not as useful? Are we planning to go to Jessie/Ocata?

If we're planning to skip Ocata and go to Newton

Maybe I'm misunderstanding something but:
Mitaka -> Newton -> Ocata
Ocata doesn't need to be skipped to go from Mitaka to Newton, it's still a future version :)

I think this may help:

In T169099 we were not talking about wanting to skip Ocata, but instead considering matching our openstack upgrade cadence to the Debian upgrade cadence where Ocata was never packaged by Debian. In https://phabricator.wikimedia.org/T169099#4407485 it's noted that there is a viable package source that exists now for Ocata so that isn't a current issue as long as this is OK. So the Release/Openstack Version ladder is laid out for M=>N=>O=>P there with O coming not from Debian upstream.

GTirloni added a comment.EditedWed, Dec 19, 3:47 PM

My bad, I understand now the comment about Stretch "missing Ocata", thanks.

I'm sorry if I'm being too dense here but why do we need to worry about Mitaka on Stretch then?

I'm not sure, generally you can go +1 for compat tho it can come with caveats as in nova-api can be one release ahead of nova-compute but not vice versa. I wouldn't think you need mitaka at all on stretch.

Would you suggest to just reimage servers directly in stretch + newton?

I believe what we did for L->M was to reimage the standby of an HA pair to new Release/Openstack version and then fail over to it with a day for sanity and then reimage the now-standby-originally-active. I believe control components can seemlessly be N+1 from nova-compute at least. This /should/ hold true for Neutron as well (neutron-api as +1 from l3-agent for example) but I've never actually tested it. In theory this allows a straight stagger of control plane to Stretch/Newton.

Bstorm added a subscriber: Bstorm.EditedWed, Dec 19, 4:17 PM

The easy solution for now seems to me to be stretch+newton as the next step (especially as we upgrade all the clients to that anyway), but a nicely HA openstack cluster on k8s using helm could track releases that are actually actively supported upstream for The Future. http://superuser.openstack.org/articles/build-openstack-kubernetes/

We are bound to run into a wall with the debian packages eventually. -- since there was talk of other versions. Just throwing that out there.

chasemp added a subscriber: Andrew.Wed, Dec 19, 4:22 PM

+1 the debian packaging path is way too tied to OS release for a sane openstack plan long term, @Andrew and I looked at a few options over time but it seems fairly popular to use some container solution

From the Openstack docs, it looks like we are most likely to break designate, but that shouldn't fail if we do what @chasemp said https://docs.openstack.org/designate/pike/admin/upgrades/newton.html

Those docs make it seem so easy 😅

[...] but a nicely HA openstack cluster on k8s using helm could track releases that are actually actively supported upstream for The Future. http://superuser.openstack.org/articles/build-openstack-kubernetes/

That is a massive model change. I suggest you put a note in T209460: CloudVPS: our ideal future model :-P and we can follow up there.

Mind that we have the exact same packaging issues with k8s and with openstack, debian being slow with packaging, introducing jumps between releases, difficulties for upgrading, etc.
Also, we don't seem to have enough power to keep upgrading openstack deployments every 6 months anyway (the openstack release cycle last time I checked), even if we did with source code directly.

We are bound to run into a wall with the debian packages eventually. -- since there was talk of other versions. Just throwing that out there.

I would stick to using debian packages as long as we can. Using source code based deployments may have other implications from several points of views: dependency sanity, security issues tracking, integration with the surrounding operating system, etc. Well, nothing new, all the benefits a well maintained package management system provides.

BTW Debian contains right now Openstack Rocky (the most current release of Openstack), so we have 5 versions ahead with "easy" upgrading path.

But perhaps all this is a bit off topic here :-P

I believe what we did for L->M was to reimage the standby of an HA pair to new Release/Openstack version and then fail over to it with a day for sanity and then reimage the now-standby-originally-active.

I don't think that that's right; I'm pretty sure we just did a standard in-place upgrade from Liberty to Mitaka -- those hosts were all running Trusty and the Ubuntu cloud repos allow for a smooth change of version.

I would certainly not expect to couple re-imaging with version upgrades if at all possible.

I believe control components can seemlessly be N+1 from nova-compute at least. This /should/ hold true for Neutron as well (neutron-api as +1 from l3-agent for example) but I've never actually tested it. In theory this allows a straight stagger of control plane to Stretch/Newton.

I'm pretty sure this is correct.

I would certainly not expect to couple re-imaging with version upgrades if at all possible.

Would you suggest trying to Jessie/Newton?

I would certainly not expect to couple re-imaging with version upgrades if at all possible.

Would you suggest trying to Jessie/Newton?

I don't know. It would be nice if we had the option of either Jessie/Newton or Stretch/Mitaka -- is that last not already available?

I would certainly not expect to couple re-imaging with version upgrades if at all possible.

Would you suggest trying to Jessie/Newton?

I don't know. It would be nice if we had the option of either Jessie/Newton or Stretch/Mitaka -- is that last not already available?

Stretch/Mitaka is available! It was my first proposal.

Stretch/Mitaka is available! It was my first proposal.

Great! Sorry I didn't read all the way to the top :)

So my preference would be to do this without any rebuilds at all:

  1. Re-image remaining Trusty hosts to Jessie/Mitaka
  2. Do an in-place OS version of all Jessie/Mitaka hosts to Stretch/Mitaka
  3. Do a standard package upgrade to Stretch/Newton
  4. (soon) repeat step 3 for Stretch/Ocata, etc.

Step 2 is especially important for cloudvirts because it avoids (or at least minimizes) VM downtime. Is there any reason why an in-place Jessie->Stretch upgrade is a bad idea? I presume we'd need to reboot at some point for a kernel update but I'm not clear on if that's necessary coupled to the OS version upgrade.

So my preference would be to do this without any rebuilds at all:

  1. Re-image remaining Trusty hosts to Jessie/Mitaka
  2. Do an in-place OS version of all Jessie/Mitaka hosts to Stretch/Mitaka
  3. Do a standard package upgrade to Stretch/Newton
  4. (soon) repeat step 3 for Stretch/Ocata, etc.

Makes sense.

Step 2 is especially important for cloudvirts because it avoids (or at least minimizes) VM downtime. Is there any reason why an in-place Jessie->Stretch upgrade is a bad idea? I presume we'd need to reboot at some point for a kernel update but I'm not clear on if that's necessary coupled to the OS version upgrade.

Will do some tests to see how the resolver behaves and get back to you.

Change 480944 had a related patch set uploaded (by Arturo Borrero Gonzalez; owner: Arturo Borrero Gonzalez):
[operations/puppet@production] openstack: introduce basic support for running Mitaka on Stretch on virt nodes

https://gerrit.wikimedia.org/r/480944

Change 480944 merged by Arturo Borrero Gonzalez:
[operations/puppet@production] openstack: introduce basic support for running Mitaka on Stretch on virt nodes

https://gerrit.wikimedia.org/r/480944

Change 480954 had a related patch set uploaded (by Arturo Borrero Gonzalez; owner: Arturo Borrero Gonzalez):
[operations/puppet@production] openstack: add more stretch/mitaka support

https://gerrit.wikimedia.org/r/480954

Change 480954 merged by Arturo Borrero Gonzalez:
[operations/puppet@production] openstack: add more stretch/mitaka support

https://gerrit.wikimedia.org/r/480954

aborrero added a comment.EditedThu, Dec 20, 1:50 PM

Today I tried installing cloudvirt1030 directly in Stretch. I found several puppet issues when doing it:

  • interfaces are no longer called ethX, but eno1, eno2, eno3. Our puppet code for openstack assumes ethX all over the place. Not sure if this can be solved with basic hiera overriding or we need further puppet refactoring
  • some packages were renamed/removed/reallocated from jessie to stretch. When installing stuff from jessie-backports, I found problems with sqlite3 and libmysqlclient18 (now libmariadbclient18).
    • we lack support for this dependency games in our current puppet codebase for Stretch/Mitaka, could be added easily, but mind next point as well.
  • we may get to the Mitaka/Stretch state from 2 very different paths. Supporting both on them in puppet may be really complex:
    • in a brand new server install, directly in Stretch
    • in an operating system upgrade, the server was originally installed in Jessie and is now being upgraded to Stretch
  • I stopped digging further, since the challenge is big enough already. My feeling is that Mitaka/Stretch is not that difficult, but our puppet codebase makes it difficult.

Supporting both code paths in puppet may be overkill. The cleaner solution is to only support one path in our puppet codebase, two options:

  1. If we support Mitaka/Stretch for fresh installs, this is a very clean solution, but involves a lot of downtime for VMs since we have to migrate them to other virt node while reimaging the host.
  2. If we support Mitaka/Stretch from operating system upgrade, it may be easier to get new cloudvirts working right away (installing them with Jessie). Since all servers are Jessie anyway, this is the most common case. But this is a bit ugly, and the moment we rebuild an already-stretch host we may face problems, since we may be in case 1 anyway (or back to having to do the jessie->stretch upgrade by hand).

Proposal:

I believe the option that gives us less technical debt is option 1, at the cost of downtime for our users.

And in any case, we need further puppet refactoring to better support our matrix of deployment/operatingsystem/openstackversion. If we don't take immediate actions, our puppet puzzle will grow bigger and bigger.
I can do this, no problem, but will take me time and some painful changes in the puppet code.

To don't stall eqiad1 migration, we should install new cloudvirts with jessie now and put them into services ASAP, while developing option 1 for the short/mid term future.

Change 481005 had a related patch set uploaded (by Arturo Borrero Gonzalez; owner: Arturo Borrero Gonzalez):
[operations/puppet@production] cloudvps: swith cloudvirt1030 to openstack newton

https://gerrit.wikimedia.org/r/481005

Change 481005 merged by Arturo Borrero Gonzalez:
[operations/puppet@production] cloudvps: swith cloudvirt1030 to openstack newton

https://gerrit.wikimedia.org/r/481005

Change 481006 had a related patch set uploaded (by Arturo Borrero Gonzalez; owner: Arturo Borrero Gonzalez):
[operations/puppet@production] openstack: introduce nova templates for newton

https://gerrit.wikimedia.org/r/481006

Change 481006 merged by Arturo Borrero Gonzalez:
[operations/puppet@production] openstack: introduce templates for newton

https://gerrit.wikimedia.org/r/481006

Change 481155 had a related patch set uploaded (by Arturo Borrero Gonzalez; owner: Arturo Borrero Gonzalez):
[operations/puppet@production] openstack: nova: compute: the libvirt service in stretch depends on other pkgs

https://gerrit.wikimedia.org/r/481155

Change 481155 merged by Arturo Borrero Gonzalez:
[operations/puppet@production] openstack: nova: compute: the libvirt service in stretch depends on other pkgs

https://gerrit.wikimedia.org/r/481155

Change 481158 had a related patch set uploaded (by Arturo Borrero Gonzalez; owner: Arturo Borrero Gonzalez):
[operations/puppet@production] openstack: introduce config files for newton

https://gerrit.wikimedia.org/r/481158

Change 481158 merged by Arturo Borrero Gonzalez:
[operations/puppet@production] openstack: nova: introduce config files for newton

https://gerrit.wikimedia.org/r/481158

Change 481161 had a related patch set uploaded (by Arturo Borrero Gonzalez; owner: Arturo Borrero Gonzalez):
[operations/puppet@production] hiera: cloudvirt1030: override interface names for bridge mapping

https://gerrit.wikimedia.org/r/481161

Change 481161 merged by Arturo Borrero Gonzalez:
[operations/puppet@production] hiera: cloudvirt1030: override interface names for bridge mapping

https://gerrit.wikimedia.org/r/481161

Errors trying to run nova-compute in newton:

[...]
2018-12-20 16:02:23.942 277846 INFO nova.virt.libvirt.driver [-] Connection event '1' reason 'None'
2018-12-20 16:02:23.985 277846 WARNING nova.virt.libvirt.driver [req-a4e6b38e-23ec-4944-9ad3-58a207418c86 - - - - -] Cannot update service status on host "cloudvirt1030" due to an unexpected exception.
2018-12-20 16:02:23.985 277846 ERROR nova.virt.libvirt.driver Traceback (most recent call last):
2018-12-20 16:02:23.985 277846 ERROR nova.virt.libvirt.driver   File "/usr/lib/python2.7/dist-packages/nova/virt/libvirt/driver.py", line 3275, in _set_host_enabled
2018-12-20 16:02:23.985 277846 ERROR nova.virt.libvirt.driver     service = objects.Service.get_by_compute_host(ctx, CONF.host)
2018-12-20 16:02:23.985 277846 ERROR nova.virt.libvirt.driver   File "/usr/lib/python2.7/dist-packages/oslo_versionedobjects/base.py", line 177, in wrapper
2018-12-20 16:02:23.985 277846 ERROR nova.virt.libvirt.driver     args, kwargs)
2018-12-20 16:02:23.985 277846 ERROR nova.virt.libvirt.driver   File "/usr/lib/python2.7/dist-packages/nova/conductor/rpcapi.py", line 236, in object_class_action_versions
2018-12-20 16:02:23.985 277846 ERROR nova.virt.libvirt.driver     args=args, kwargs=kwargs)
2018-12-20 16:02:23.985 277846 ERROR nova.virt.libvirt.driver   File "/usr/lib/python2.7/dist-packages/oslo_messaging/rpc/client.py", line 169, in call
2018-12-20 16:02:23.985 277846 ERROR nova.virt.libvirt.driver     retry=self.retry)
2018-12-20 16:02:23.985 277846 ERROR nova.virt.libvirt.driver   File "/usr/lib/python2.7/dist-packages/oslo_messaging/transport.py", line 97, in _send
2018-12-20 16:02:23.985 277846 ERROR nova.virt.libvirt.driver     timeout=timeout, retry=retry)
2018-12-20 16:02:23.985 277846 ERROR nova.virt.libvirt.driver   File "/usr/lib/python2.7/dist-packages/oslo_messaging/_drivers/amqpdriver.py", line 464, in send
2018-12-20 16:02:23.985 277846 ERROR nova.virt.libvirt.driver     retry=retry)
2018-12-20 16:02:23.985 277846 ERROR nova.virt.libvirt.driver   File "/usr/lib/python2.7/dist-packages/oslo_messaging/_drivers/amqpdriver.py", line 455, in _send
2018-12-20 16:02:23.985 277846 ERROR nova.virt.libvirt.driver     raise result
2018-12-20 16:02:23.985 277846 ERROR nova.virt.libvirt.driver RemoteError: Remote error: IncompatibleObjectVersion Version 1.20 of Service is not supported
2018-12-20 16:02:23.985 277846 ERROR nova.virt.libvirt.driver [u'Traceback (most recent call last):\n', u'  File "/usr/lib/python2.7/dist-packages/oslo_messaging/rpc/dispatcher.py", line 138, in _dispatch_and_reply\n    incoming.message))\n', u'  File "/usr/lib/python2.7/dist-packages/oslo_messaging/rpc/dispatcher.py", line 185, in _dispatch\n    return self._do_dispatch(endpoint, method, ctxt, args)\n', u'  File "/usr/lib/python2.7/dist-packages/oslo_messaging/rpc/dispatcher.py", line 127, in _do_dispatch\n    result = func(ctxt, **new_args)\n', u'  File "/usr/lib/python2.7/dist-packages/nova/conductor/manager.py", line 92, in object_class_action_versions\n    objname, object_versions[objname])\n', u'  File "/usr/lib/python2.7/dist-packages/oslo_versionedobjects/base.py", line 374, in obj_class_from_name\n    supported=latest_ver)\n', u'IncompatibleObjectVersion: Version 1.20 of Service is not supported\n'].
2018-12-20 16:02:23.985 277846 ERROR nova.virt.libvirt.driver
[...]

Mentioned in SAL (#wikimedia-operations) [2019-01-02T16:11:34Z] <arturo> T212302 disable puppet in all {cloud,lab}virt* servers to merge https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/481194/

Mentioned in SAL (#wikimedia-operations) [2019-01-02T16:23:09Z] <arturo> T212302 re-enable puppet in all {cloud,lab}virt* servers, all was fine

Script wmf-auto-reimage was launched by aborrero on cumin1001.eqiad.wmnet for hosts:

['cloudvirt1030.eqiad.wmnet']

The log can be found in /var/log/wmf-auto-reimage/201901021629_aborrero_80784.log.

Completed auto-reimage of hosts:

['cloudvirt1030.eqiad.wmnet']

and were ALL successful.

Change 481887 had a related patch set uploaded (by Arturo Borrero Gonzalez; owner: Arturo Borrero Gonzalez):
[operations/puppet@production] openstack: nova: mitaka/stretch: require packages before installing nova-common

https://gerrit.wikimedia.org/r/481887

Change 481887 merged by Arturo Borrero Gonzalez:
[operations/puppet@production] openstack: nova: mitaka/stretch: require packages before installing nova

https://gerrit.wikimedia.org/r/481887

Change 482009 had a related patch set uploaded (by Arturo Borrero Gonzalez; owner: Arturo Borrero Gonzalez):
[operations/puppet@production] openstack: linuxbridge_agent: typo in libosinfo package name

https://gerrit.wikimedia.org/r/482009

Change 482009 merged by Arturo Borrero Gonzalez:
[operations/puppet@production] openstack: linuxbridge_agent: typo in libosinfo package name

https://gerrit.wikimedia.org/r/482009

Change 482013 had a related patch set uploaded (by Arturo Borrero Gonzalez; owner: Arturo Borrero Gonzalez):
[operations/puppet@production] openstack: nova: mitaka: stretch: install python-dogpile.core from jessie

https://gerrit.wikimedia.org/r/482013

Change 482013 merged by Arturo Borrero Gonzalez:
[operations/puppet@production] openstack: nova: mitaka: stretch: install python-dogpile.core from jessie

https://gerrit.wikimedia.org/r/482013

Script wmf-auto-reimage was launched by aborrero on cumin1001.eqiad.wmnet for hosts:

['cloudvirt1030.eqiad.wmnet']

The log can be found in /var/log/wmf-auto-reimage/201901031240_aborrero_79074.log.

Mentioned in SAL (#wikimedia-operations) [2019-01-03T12:41:26Z] <arturo> T212302 reimaging again cloudvirt1030 to test final puppet code

Completed auto-reimage of hosts:

['cloudvirt1030.eqiad.wmnet']

and were ALL successful.

It seems cloudvirt1030.eqiad.wmnet is happy now with our puppet code for mitaka/stretch. Will try now with cloudvirt1029

Script wmf-auto-reimage was launched by aborrero on cumin1001.eqiad.wmnet for hosts:

['cloudvirt1013.eqiad.wmnet']

The log can be found in /var/log/wmf-auto-reimage/201901031601_aborrero_125781.log.

I will be using the openstack CloudVPS project to try more stuff related to this, specifically the stretch/mitaka combo for cloudnet servers (and then cloudcontrol servers)

Mentioned in SAL (#wikimedia-cloud) [2019-01-04T14:05:36Z] <arturo> T212302 creating openstack-puppetmaster-01 and cloudvps-upgrade-test VM instances

aborrero added a comment.EditedFri, Jan 4, 5:37 PM

I was able to begin testing installation of cloudnet nodes in a VM following these steps:

  • puppetmaster: openstack-puppetmaster-01.openstack.eqiad.wmflabs
  • vm: cloudvps-upgrade-test.openstack.eqiad.wmflabs (stretch)
  1. in the puppetmaster, apply this patch:
diff --git a/modules/role/manifests/labs/instance.pp b/modules/role/manifests/labs/instance.pp
index 91833b8c32..320492e25b 100644
--- a/modules/role/manifests/labs/instance.pp
+++ b/modules/role/manifests/labs/instance.pp
@@ -4,7 +4,7 @@ class role::labs::instance {
     include ::profile::base::labs
     include sudo
     include ::base::instance_upstarts
-    include ::profile::openstack::main::observerenv
+    #include ::profile::openstack::main::observerenv
     include ::profile::openstack::main::cumin::target
 
     sudo::group { 'ops':
  1. in horizon, apply this basic hiera config to the vm:

(this hiera config won't be useful for running neutron, but it is for checking package installation, which is what I'm looking for)

profile::openstack::base::neutron::db_user: x
profile::openstack::base::neutron::physical_interface_mappings: {}
profile::openstack::base::neutron::rabbit_user: x
profile::openstack::eqiad1::keystone_host: x.example.com
profile::openstack::eqiad1::ldap_user_pass: x
profile::openstack::eqiad1::neutron::agent_down_time: 2
profile::openstack::eqiad1::neutron::db_host: x.example.com
profile::openstack::eqiad1::neutron::db_pass: x
profile::openstack::eqiad1::neutron::dmz_cidr:
- 0.0.0.0
profile::openstack::eqiad1::neutron::l3_agent_bridge_mappings:
  br: x
profile::openstack::eqiad1::neutron::l3_agent_bridges:
  br:
    addif: eth1.0
profile::openstack::eqiad1::neutron::log_agent_heartbeats: x
profile::openstack::eqiad1::neutron::metadata_proxy_shared_secret: x
profile::openstack::eqiad1::neutron::network_compat_interface: eth1.0
profile::openstack::eqiad1::neutron::network_compat_interface_vlan: 0
profile::openstack::eqiad1::neutron::network_flat_interface: eth1.1
profile::openstack::eqiad1::neutron::network_flat_interface_external: eth1.2
profile::openstack::eqiad1::neutron::network_flat_interface_vlan: 1
profile::openstack::eqiad1::neutron::network_flat_interface_vlan_external: 2
profile::openstack::eqiad1::neutron::network_public_ip: 0.0.0.0
profile::openstack::eqiad1::neutron::rabbit_pass: x
profile::openstack::eqiad1::neutron::report_interval: x
profile::openstack::eqiad1::neutron::tld: x.x
profile::openstack::eqiad1::nova::dhcp_domain: x
profile::openstack::eqiad1::nova_controller: x.example.com
profile::openstack::eqiad1::observer_password: x
profile::openstack::eqiad1::region: test-r
profile::openstack::eqiad1::version: mitaka
puppetmaster: openstack-puppetmaster-01.openstack.eqiad.wmflabs
  1. in the vm, create a dummy eth1 interface: sudo ip link add eth1 type dummy
  2. in horizon, apply this role to the vm: role::wmcs::openstack::eqiad1::net
aborrero added a subscriber: bd808.Tue, Jan 8, 7:06 PM

For the record, @bd808 helped me install neutron-common in a CloudVPS VM instance by deleting the neutron user from the cloud LDAP: P7967

Change 483408 had a related patch set uploaded (by Arturo Borrero Gonzalez; owner: Arturo Borrero Gonzalez):
[operations/puppet@production] openstack: enable net nodes in the mitaka/stretch combination

https://gerrit.wikimedia.org/r/483408

Change 483408 merged by Arturo Borrero Gonzalez:
[operations/puppet@production] openstack: enable net nodes in the mitaka/stretch combination

https://gerrit.wikimedia.org/r/483408

aborrero added a comment.EditedThu, Jan 10, 1:36 PM

Change 483408 merged by Arturo Borrero Gonzalez:
[operations/puppet@production] openstack: enable net nodes in the mitaka/stretch combination

https://gerrit.wikimedia.org/r/483408

After this patch, we should be able to rebuild cloudnet1003 and cloudnet1004 as mitaka/stretch. @Andrew let me know if this is OK:

  • I would select the inactive node in the HA pair
  • rebuild it in stretch
  • see if mitaka/stretch can work with mitaka/jessie, i.e, they can be a HA pair again, working as expected
  • if all is fine, switch the HA active node and do the same with the remaining mitaka/jessie node

This process can cause downtime:

  • while rebuilding the inactive node, we won't have HA support.
  • while switching the active node from mitaka/jessie to mitaka/stretch, that's the critical moment of truth to see if the mitaka/stretch combo is in well shape for actual workload

Also, please note that I have no idea yet how we will do mitaka/stretch -> newton/stretch. That would likely mean another reimage of the servers.

Mentioned in SAL (#wikimedia-operations) [2019-01-10T13:51:18Z] <arturo> T212302 icinga downtime for 2h cloudvirt[1013,1024,1026-1030].eqiad.wmnet bc wrong puppet code

Change 483416 had a related patch set uploaded (by Arturo Borrero Gonzalez; owner: Arturo Borrero Gonzalez):
[operations/puppet@production] openstack: remove redundant sqlite3 declaration in cloudvirt hosts

https://gerrit.wikimedia.org/r/483416

Change 483416 merged by Arturo Borrero Gonzalez:
[operations/puppet@production] openstack: remove redundant sqlite3 declaration in cloudvirt hosts

https://gerrit.wikimedia.org/r/483416

After this patch, we should be able to rebuild cloudnet1003 and cloudnet1004 as mitaka/stretch. @Andrew let me know if this is OK

That all sounds good to me. I'm not sure I understand how the HA aspects of neutron are implemented here, but as long as switching from the passive to the active server is simple then it's a good plan. I would notify users ahead of time though.

Also, regarding mitaka -> newton, I wouldn't expect that to require a rebuild (at least, version upgrade haven't in the past) but we can cross that when we get to it :)