Page MenuHomePhabricator

CloudVPS: upgrade: jessie -> stretch & mitaka -> newton
Closed, ResolvedPublic

Description

We need to upgrade our Cloud VPS infra:

  • from Debian Jessie to Debian Stretch
  • from Mitaka to Newton

Since there is no way to install Newton in Jessie, we need to try to install Mitaka in Stretch and then upgrade Mitaka -> Newton.

All Mitaka packages are in the jessie-backports Debian repository, and we may try hacking doing the trick of using that repo in Stretch.
My proposal is to try something like this:

  1. Image a server with Debian Stretch
  2. Enable the jessie-backports repo
  3. Install Mitaka packages from that repo
  4. Upgrade Mitaka (jessie-backports) to Newton (stretch)
  5. Cleanup any remaining jessie-backports package and use only what's provided in stretch

PS: There is a small summary of the versioning matrix in T169099#4676842

Details

SubjectRepoBranchLines +/-
operations/puppetproduction+33 -5
operations/puppetproduction+0 -177
operations/puppetproduction+4 -4
operations/puppetproduction+4 -0
operations/puppetproduction+7 -7
operations/puppetproduction+1 -1
operations/puppetproduction+79 -0
operations/puppetproduction+3 K -0
operations/puppetproduction+4 -4
operations/puppetproduction+4 -0
operations/puppetproduction+1 -1
operations/puppetproduction+1 K -0
operations/puppetproduction+5 K -0
operations/puppetproduction+2 -2
operations/puppetproduction+3 K -0
operations/puppetproduction+451 -0
operations/puppetproduction+19 -1
operations/puppetproduction+0 -2
operations/puppetproduction+8 -5
operations/puppetproduction+149 -72
operations/puppetproduction+1 -1
operations/puppetproduction+6 -1
operations/puppetproduction+2 -0
operations/puppetproduction+419 -0
operations/puppetproduction+1 -1
operations/puppetproduction+485 -0
operations/puppetproduction+14 -2
operations/puppetproduction+9 -6
operations/puppetproduction+14 -3
Show related patches Customize query in gerrit

Related Objects

StatusSubtypeAssignedTask
ResolvedAndrew
ResolvedAndrew
OpenNone
ResolvedMoritzMuehlenhoff
OpenNone
Resolvedaborrero
Resolvedaborrero
Resolvedaborrero
ResolvedPapaul
Resolved JHedden
Resolvedaborrero
Resolvedaborrero
ResolvedPapaul
Resolvedaborrero
Resolvedaborrero
Resolvedaborrero
Resolvedaborrero
ResolvedAndrew
Resolvedaborrero
Resolvedaborrero
ResolvedAndrew
Resolvedaborrero
Resolvedaborrero
ResolvedAndrew
Resolved Marostegui
Resolvedaborrero
ResolvedAndrew
DuplicateNone
ResolvedAndrew
ResolvedAndrew
Invalid JHedden

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

It seems cloudvirt1030.eqiad.wmnet is happy now with our puppet code for mitaka/stretch. Will try now with cloudvirt1029

Script wmf-auto-reimage was launched by aborrero on cumin1001.eqiad.wmnet for hosts:

['cloudvirt1013.eqiad.wmnet']

The log can be found in /var/log/wmf-auto-reimage/201901031601_aborrero_125781.log.

I will be using the openstack CloudVPS project to try more stuff related to this, specifically the stretch/mitaka combo for cloudnet servers (and then cloudcontrol servers)

Mentioned in SAL (#wikimedia-cloud) [2019-01-04T14:05:36Z] <arturo> T212302 creating openstack-puppetmaster-01 and cloudvps-upgrade-test VM instances

I was able to begin testing installation of cloudnet nodes in a VM following these steps:

  • puppetmaster: openstack-puppetmaster-01.openstack.eqiad.wmflabs
  • vm: cloudvps-upgrade-test.openstack.eqiad.wmflabs (stretch)
  1. in the puppetmaster, apply this patch:
diff --git a/modules/role/manifests/labs/instance.pp b/modules/role/manifests/labs/instance.pp
index 91833b8c32..320492e25b 100644
--- a/modules/role/manifests/labs/instance.pp
+++ b/modules/role/manifests/labs/instance.pp
@@ -4,7 +4,7 @@ class role::labs::instance {
     include ::profile::base::labs
     include sudo
     include ::base::instance_upstarts
-    include ::profile::openstack::main::observerenv
+    #include ::profile::openstack::main::observerenv
     include ::profile::openstack::main::cumin::target
 
     sudo::group { 'ops':
  1. in horizon, apply this basic hiera config to the vm:

(this hiera config won't be useful for running neutron, but it is for checking package installation, which is what I'm looking for)

profile::openstack::base::neutron::db_user: x
profile::openstack::base::neutron::physical_interface_mappings: {}
profile::openstack::base::neutron::rabbit_user: x
profile::openstack::eqiad1::keystone_host: x.example.com
profile::openstack::eqiad1::ldap_user_pass: x
profile::openstack::eqiad1::neutron::agent_down_time: 2
profile::openstack::eqiad1::neutron::db_host: x.example.com
profile::openstack::eqiad1::neutron::db_pass: x
profile::openstack::eqiad1::neutron::dmz_cidr:
- 0.0.0.0
profile::openstack::eqiad1::neutron::l3_agent_bridge_mappings:
  br: x
profile::openstack::eqiad1::neutron::l3_agent_bridges:
  br:
    addif: eth1.0
profile::openstack::eqiad1::neutron::log_agent_heartbeats: x
profile::openstack::eqiad1::neutron::metadata_proxy_shared_secret: x
profile::openstack::eqiad1::neutron::network_compat_interface: eth1.0
profile::openstack::eqiad1::neutron::network_compat_interface_vlan: 0
profile::openstack::eqiad1::neutron::network_flat_interface: eth1.1
profile::openstack::eqiad1::neutron::network_flat_interface_external: eth1.2
profile::openstack::eqiad1::neutron::network_flat_interface_vlan: 1
profile::openstack::eqiad1::neutron::network_flat_interface_vlan_external: 2
profile::openstack::eqiad1::neutron::network_public_ip: 0.0.0.0
profile::openstack::eqiad1::neutron::rabbit_pass: x
profile::openstack::eqiad1::neutron::report_interval: x
profile::openstack::eqiad1::neutron::tld: x.x
profile::openstack::eqiad1::nova::dhcp_domain: x
profile::openstack::eqiad1::nova_controller: x.example.com
profile::openstack::eqiad1::observer_password: x
profile::openstack::eqiad1::region: test-r
profile::openstack::eqiad1::version: mitaka
puppetmaster: openstack-puppetmaster-01.openstack.eqiad.wmflabs
  1. in the vm, create a dummy eth1 interface: sudo ip link add eth1 type dummy
  2. in horizon, apply this role to the vm: role::wmcs::openstack::eqiad1::net

For the record, @bd808 helped me install neutron-common in a CloudVPS VM instance by deleting the neutron user from the cloud LDAP: P7967

Change 483408 had a related patch set uploaded (by Arturo Borrero Gonzalez; owner: Arturo Borrero Gonzalez):
[operations/puppet@production] openstack: enable net nodes in the mitaka/stretch combination

https://gerrit.wikimedia.org/r/483408

Change 483408 merged by Arturo Borrero Gonzalez:
[operations/puppet@production] openstack: enable net nodes in the mitaka/stretch combination

https://gerrit.wikimedia.org/r/483408

Change 483408 merged by Arturo Borrero Gonzalez:
[operations/puppet@production] openstack: enable net nodes in the mitaka/stretch combination

https://gerrit.wikimedia.org/r/483408

After this patch, we should be able to rebuild cloudnet1003 and cloudnet1004 as mitaka/stretch. @Andrew let me know if this is OK:

  • I would select the inactive node in the HA pair
  • rebuild it in stretch
  • see if mitaka/stretch can work with mitaka/jessie, i.e, they can be a HA pair again, working as expected
  • if all is fine, switch the HA active node and do the same with the remaining mitaka/jessie node

This process can cause downtime:

  • while rebuilding the inactive node, we won't have HA support.
  • while switching the active node from mitaka/jessie to mitaka/stretch, that's the critical moment of truth to see if the mitaka/stretch combo is in well shape for actual workload

Also, please note that I have no idea yet how we will do mitaka/stretch -> newton/stretch. That would likely mean another reimage of the servers.

Mentioned in SAL (#wikimedia-operations) [2019-01-10T13:51:18Z] <arturo> T212302 icinga downtime for 2h cloudvirt[1013,1024,1026-1030].eqiad.wmnet bc wrong puppet code

Change 483416 had a related patch set uploaded (by Arturo Borrero Gonzalez; owner: Arturo Borrero Gonzalez):
[operations/puppet@production] openstack: remove redundant sqlite3 declaration in cloudvirt hosts

https://gerrit.wikimedia.org/r/483416

Change 483416 merged by Arturo Borrero Gonzalez:
[operations/puppet@production] openstack: remove redundant sqlite3 declaration in cloudvirt hosts

https://gerrit.wikimedia.org/r/483416

After this patch, we should be able to rebuild cloudnet1003 and cloudnet1004 as mitaka/stretch. @Andrew let me know if this is OK

That all sounds good to me. I'm not sure I understand how the HA aspects of neutron are implemented here, but as long as switching from the passive to the active server is simple then it's a good plan. I would notify users ahead of time though.

Also, regarding mitaka -> newton, I wouldn't expect that to require a rebuild (at least, version upgrade haven't in the past) but we can cross that when we get to it :)

Change 485185 had a related patch set uploaded (by Arturo Borrero Gonzalez; owner: Arturo Borrero Gonzalez):
[operations/puppet@production] labtestneutron2001: reimage to stretch and rename to cloudnet2001-dev

https://gerrit.wikimedia.org/r/485185

Change 490619 had a related patch set uploaded (by Andrew Bogott; owner: Andrew Bogott):
[operations/puppet@production] cloudvirt1018: when we reimage, do so as Stretch

https://gerrit.wikimedia.org/r/490619

Change 490619 merged by Andrew Bogott:
[operations/puppet@production] cloudvirt1018: when we reimage, do so as Stretch

https://gerrit.wikimedia.org/r/490619

Just for reference, here's the currentl list of OpenStack releases.

Screenshot from 2019-03-21 13-31-58.png (356×856 px, 65 KB)

If we started right now, it seems that even Queens would be risky release since Extended Maintenance does not seem to ensure security fixes (only that community members decided to keep it in a best effort basis). With our current capacity, it seems that Rocky or even Stein would yield more future-proof results.

However, such a big jump means we would have to implement a new region and migrate VMs over (like we did for Neutron). That makes having shared storage critical (for faster migrations).

The alternative is to go Mitaka -> Newton -> Ocata -> Pike -> Queen -> Rocky -> Stein. Since we have had historical trouble keeping up with OpenStack releases, it seems improbable we will be able to handle 5-6 major upgrades in a short time to reach a release that is in Maintained status.

And there's the question of aligning with Debian releases. Debian Buster seems to come with OpenStack Rocky.

If we started right now, it seems that even Queens would be risky release since Extended Maintenance does not seem to ensure security fixes (only that community members decided to keep it in a best effort basis). With our current capacity, it seems that Rocky or even Stein would yield more future-proof results.

One reason we had trouble (besides the fact that we use OS-based Debian packages, which is an obvious problem for this) is Neutron. We couldn't get out of our old version as long as we were on novanetwork. We might be able to move faster than in the past once novanetwork dies?

And there's the question of aligning with Debian releases. Debian Buster seems to come with OpenStack Rocky.

I don't personally see any value in aligning with Debian releases for a core product because it isn't Debian's core product. I've been thinking this is a cool thing, however, that also means we cannot easily upgrade while following the Debian release cycle. So maybe we'll *have* to stop following Debian packages eventually.

We have all this information at hand here: https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Openstack_source

Not sure why you bring this up now though.

Change 533923 had a related patch set uploaded (by Andrew Bogott; owner: Andrew Bogott):
[operations/puppet@production] glance: add Newton config files

https://gerrit.wikimedia.org/r/533923

Change 533924 had a related patch set uploaded (by Andrew Bogott; owner: Andrew Bogott):
[operations/puppet@production] keystone: forward mitaka config to newton

https://gerrit.wikimedia.org/r/533924

Change 533925 had a related patch set uploaded (by Andrew Bogott; owner: Andrew Bogott):
[operations/puppet@production] keystone: update policy.json for Newton

https://gerrit.wikimedia.org/r/533925

Change 533926 had a related patch set uploaded (by Andrew Bogott; owner: Andrew Bogott):
[operations/puppet@production] Designate: add Newton config files and resources

https://gerrit.wikimedia.org/r/533926

Change 533927 had a related patch set uploaded (by Andrew Bogott; owner: Andrew Bogott):
[operations/puppet@production] Openstack Neutron: added config files and templates for version Newton

https://gerrit.wikimedia.org/r/533927

Change 533923 merged by Andrew Bogott:
[operations/puppet@production] glance: add Newton config files

https://gerrit.wikimedia.org/r/533923

Change 533924 merged by Andrew Bogott:
[operations/puppet@production] keystone: forward mitaka config to newton

https://gerrit.wikimedia.org/r/533924

Change 533925 merged by Andrew Bogott:
[operations/puppet@production] keystone: update policy.json for Newton

https://gerrit.wikimedia.org/r/533925

Change 533927 merged by Andrew Bogott:
[operations/puppet@production] Openstack Neutron: added config files and templates for version Newton

https://gerrit.wikimedia.org/r/533927

Change 533926 merged by Andrew Bogott:
[operations/puppet@production] Designate: add Newton config files and resources

https://gerrit.wikimedia.org/r/533926

Change 538027 had a related patch set uploaded (by Andrew Bogott; owner: Andrew Bogott):
[operations/puppet@production] Horizon: put into maintenance mode during designate upgrade

https://gerrit.wikimedia.org/r/538027

Change 538028 had a related patch set uploaded (by Andrew Bogott; owner: Andrew Bogott):
[operations/puppet@production] Designate: move to OpenStack version 'newton'

https://gerrit.wikimedia.org/r/538028

Change 538027 merged by Andrew Bogott:
[operations/puppet@production] Horizon: put into maintenance mode during designate upgrade

https://gerrit.wikimedia.org/r/538027

Change 538028 merged by Andrew Bogott:
[operations/puppet@production] Designate: move to OpenStack version 'newton'

https://gerrit.wikimedia.org/r/538028

Change 538031 had a related patch set uploaded (by Andrew Bogott; owner: Andrew Bogott):
[operations/puppet@production] Designate: upgrade eqiad1 to Newton

https://gerrit.wikimedia.org/r/538031

Change 538031 merged by Andrew Bogott:
[operations/puppet@production] Designate: upgrade eqiad1 to Newton

https://gerrit.wikimedia.org/r/538031

Cloudservices1003 and 1004 are now running Designate version Newton. There are a few more steps that we should take before we're ready for Ocata there, though -- we need to move to the worker/producer model and also (probably) to pdns4.

Change 538085 had a related patch set uploaded (by Andrew Bogott; owner: Andrew Bogott):
[operations/puppet@production] designate: switch to the worker/producer model

https://gerrit.wikimedia.org/r/538085

Change 538430 had a related patch set uploaded (by Andrew Bogott; owner: Andrew Bogott):
[operations/puppet@production] Openstack: add some missing files for Newton

https://gerrit.wikimedia.org/r/538430

Change 538430 merged by Andrew Bogott:
[operations/puppet@production] Openstack: add some missing files for Newton

https://gerrit.wikimedia.org/r/538430

Change 538431 had a related patch set uploaded (by Andrew Bogott; owner: Andrew Bogott):
[operations/puppet@production] nova: add a few more Newton files

https://gerrit.wikimedia.org/r/538431

Change 538431 merged by Andrew Bogott:
[operations/puppet@production] nova: add a few more Newton files

https://gerrit.wikimedia.org/r/538431

Change 539065 had a related patch set uploaded (by Arturo Borrero Gonzalez; owner: Arturo Borrero Gonzalez):
[operations/puppet@production] openstack: drop jessie code

https://gerrit.wikimedia.org/r/539065

aborrero added a subscriber: JHedden.

Un-claiming this tasks myself. @JHedden and @Andrew are actively working on this, probably more than me at this point.

Change 540643 had a related patch set uploaded (by Andrew Bogott; owner: Andrew Bogott):
[operations/puppet@production] Openstack: move eqiad1 glance/keystone/nova/neutron to Newton

https://gerrit.wikimedia.org/r/540643

Change 541133 had a related patch set uploaded (by Andrew Bogott; owner: Andrew Bogott):
[operations/puppet@production] Horizon: put in maintenance mode for the mitaka->newton upgrade

https://gerrit.wikimedia.org/r/541133

Change 541133 merged by Andrew Bogott:
[operations/puppet@production] Horizon: put in maintenance mode for the mitaka->newton upgrade

https://gerrit.wikimedia.org/r/541133

Change 540643 merged by Andrew Bogott:
[operations/puppet@production] Openstack: move eqiad1 glance/keystone/nova/neutron to Newton

https://gerrit.wikimedia.org/r/540643

Mentioned in SAL (#wikimedia-cloud) [2019-10-07T14:07:09Z] <arturo> horizon is disabled for maintenance (T212302)

Mentioned in SAL (#wikimedia-operations) [2019-10-07T14:25:16Z] <arturo> upgrading openstack in CloudVPS. Some IRC bots and related stuff may be unavailable (T212302)

Change 541310 had a related patch set uploaded (by Jhedden; owner: Jhedden):
[operations/puppet@production] openstack: pin mitaka version on jessie openstack clients

https://gerrit.wikimedia.org/r/541310

Change 541310 merged by Jhedden:
[operations/puppet@production] openstack: pin mitaka version on jessie openstack clients

https://gerrit.wikimedia.org/r/541310

Change 541342 had a related patch set uploaded (by Jhedden; owner: Jhedden):
[operations/puppet@production] openstack: Update jessie openstack clients in eqiad1

https://gerrit.wikimedia.org/r/541342

Change 541342 merged by Jhedden:
[operations/puppet@production] openstack: Update jessie openstack clients in eqiad1

https://gerrit.wikimedia.org/r/541342

Change 539065 merged by Arturo Borrero Gonzalez:
[operations/puppet@production] openstack: drop jessie code

https://gerrit.wikimedia.org/r/539065

Mentioned in SAL (#wikimedia-cloud) [2019-10-17T14:41:28Z] <jeh> deleting failed stresstest VMs that have multiple designate records stresstest1024-16-[16,17,64] left over from newton upgrade T212302

aborrero claimed this task.

Closing task now, this work has been done already.

Change 550659 had a related patch set uploaded (by Arturo Borrero Gonzalez; owner: Arturo Borrero Gonzalez):
[operations/puppet@production] openstack: clientpackages: vms: refresh comments and messages

https://gerrit.wikimedia.org/r/550659

Change 550659 merged by Arturo Borrero Gonzalez:
[operations/puppet@production] openstack: clientpackages: vms: refresh comments and messages

https://gerrit.wikimedia.org/r/550659