Start with cloudservices200[23]-dev.wikimedia.org (T304702). The current Horizon deploy is backwards-compatible with W (How is this learned?). So that leaves the cloudcontrol, cloudnet, and cloudvirt nodes to upgrade.
- update IRC topic
- downtime everything in icinga through 14:00CDT
aborrero@cumin1001:~ $ sudo cookbook sre.hosts.downtime -r "upgrading openstack" --min 120 lab*
aborrero@cumin1001:~ $ sudo cookbook sre.hosts.downtime -r "upgrading openstack" --min 120 cloud*
- downtime Horizon with https://gerrit.wikimedia.org/r/c/operations/puppet/+/682948
- start an ssh session with a running VM so that you notice if/when the network goes down
- disable puppet on all cloud* hosts
$ sudo cumin 'cloud*dev*' "disable-puppet 'Upgrading to openstack Wallaby - T304694 - ${USER}'"
- dump databases on cloudcontrol2001-dev.wikimedia.org: nova_eqiad1, nova_api_eqiad1, nova_cell0_eqiad1, neutron, glance, keystone, cinder:
- mysqldump -u root nova > /root/wallabydbbackups/nova.sql
- mysqldump -u root nova_api > /root/wallabydbbackups/nova_api.sql
- mysqldump -u root nova_cell0 > /root/wallabydbbackups/nova_cell0.sql
- mysqldump -u root neutron > /root/wallabydbbackups/neutron.sql
- mysqldump -u root glance > /root/wallabydbbackups/glance.sql
- mysqldump -u root placement > /root/wallabydbbackups/placement.sql
- mysqldump -u root keystone > /root/wallabydbbackups/keystone.sql
- merge puppet patch https://gerrit.wikimedia.org/r/c/operations/puppet/+/775278
Cloudcontrols:
All open database connections post-upgrade https://phabricator.wikimedia.org/P10999
Checking haproxy status echo "show stat" | socat /var/run/haproxy/haproxy.sock stdio | grep DOWN
cloudcontrol2001-dev.wikimedia.org:
- puppet agent --enable && puppet agent -tv
- apt-get update
- systemctl unmask keystone
- DEBIAN_FRONTEND=noninteractive apt-get install glance python3-eventlet=0.30.2-1 glance-api glance-common keystone nova-api nova-conductor nova-scheduler nova-common glance neutron-server python3-requests python3-urllib3 placement-api cinder-volume cinder-scheduler cinder-api python3-oslo.messaging python3-tooz -o "Dpkg::Options::=--force-confdef" -o "Dpkg::Options::=--force-confold"
- DEBIAN_FRONTEND=noninteractive apt-get install python3-trove trove-api trove-common trove-conductor trove-taskmanager -o "Dpkg::Options::=--force-confdef" -o "Dpkg::Options::=--force-confold"
- DEBIAN_FRONTEND=noninteractive apt-get upgrade -o "Dpkg::Options::=--force-confdef" -o "Dpkg::Options::=--force-confold"
- systemctl mask keystone
- puppet agent -tv
- nova-manage api_db sync
- nova-manage db sync
- placement-manage db sync
- glance-manage db_sync
- keystone-manage db_sync
- cinder-manage db online_data_migrations
- cinder-manage db sync
- trove-manage db_sync
- puppet agent -tv
- nova-manage db online_data_migrations
- systemctl list-units --failed
- neutron-db-manage upgrade heads (should show nothing failed, or just keystone. If keystone is failed just reset with systemctl reset-failed)
cloudcontrol2003-dev.wikimedia.org:
- puppet agent --enable && puppet agent -tv
- apt-get update
- systemctl unmask keystone
- DEBIAN_FRONTEND=noninteractive apt-get install glance python3-eventlet=0.30.2-1 glance-api glance-common keystone nova-api nova-conductor nova-scheduler nova-common glance neutron-server python3-requests python3-urllib3 placement-api cinder-volume cinder-scheduler cinder-api python3-oslo.messaging python3-tooz -o "Dpkg::Options::=--force-confdef" -o "Dpkg::Options::=--force-confold"
- DEBIAN_FRONTEND=noninteractive apt-get install python3-trove trove-api trove-common trove-conductor trove-taskmanager -o "Dpkg::Options::=--force-confdef" -o "Dpkg::Options::=--force-confold"
- DEBIAN_FRONTEND=noninteractive apt-get upgrade -o "Dpkg::Options::=--force-confdef" -o "Dpkg::Options::=--force-confold"
- systemctl mask keystone
- puppet agent -tv
- puppet agent -tv
- systemctl list-units --failed (should show nothing failed, or just keystone. If keystone is failed just reset with systemctl reset-failed)
cloudcontrol2004-dev.wikimedia.org:
- puppet agent --enable && puppet agent -tv
- apt-get update
- systemctl unmask keystone
- DEBIAN_FRONTEND=noninteractive apt-get install glance python3-eventlet=0.30.2-1 glance-api glance-common keystone nova-api nova-conductor nova-scheduler nova-common glance neutron-server python3-requests python3-urllib3 placement-api cinder-volume cinder-scheduler cinder-api python3-oslo.messaging python3-tooz -o "Dpkg::Options::=--force-confdef" -o "Dpkg::Options::=--force-confold"
- DEBIAN_FRONTEND=noninteractive apt-get install python3-trove trove-api trove-common trove-conductor trove-taskmanager -o "Dpkg::Options::=--force-confdef" -o "Dpkg::Options::=--force-confold"
- DEBIAN_FRONTEND=noninteractive apt-get upgrade -o "Dpkg::Options::=--force-confdef" -o "Dpkg::Options::=--force-confold"
- systemctl mask keystone
- puppet agent -tv
- puppet agent -tv
- systemctl list-units --failed (should show nothing failed, or just keystone. If keystone is failed just reset with systemctl reset-failed)
cloudnets (one at a time please):
Begin with the standby node, as determined with:
$ neutron l3-agent-list-hosting-router cloudinstances2b-gw
Standby node (cloudnet2002-dev.codfw.wmnet):
- puppet agent --enable && puppet agent -tv
- apt-get update
- DEBIAN_FRONTEND=noninteractive apt-get install -o "Dpkg::Options::=--force-confdef" -o "Dpkg::Options::=--force-confold" neutron-l3-agent python3-oslo.messaging python3-neutronclient python3-glanceclient
- DEBIAN_FRONTEND=noninteractive apt-get upgrade -o "Dpkg::Options::=--force-confdef" -o "Dpkg::Options::=--force-confold"
- puppet agent -tv
- run neutron-db-manage upgrade heads on cloudcontrol2001-dev.wikimedia.org
Active node (cloudnet2004-dev.codfw.wmnet):
- puppet agent --enable && puppet agent -tv
- apt-get update
- DEBIAN_FRONTEND=noninteractive apt-get install -o "Dpkg::Options::=--force-confdef" -o "Dpkg::Options::=--force-confold" neutron-l3-agent python3-oslo.messaging python3-neutronclient python3-glanceclient
- DEBIAN_FRONTEND=noninteractive apt-get upgrade -o "Dpkg::Options::=--force-confdef" -o "Dpkg::Options::=--force-confold"
- puppet agent -tv
- restore Horizon with https://gerrit.wikimedia.org/r/c/operations/puppet/+/682950
Break Time
Cloudvirts (cloudvirt2001-dev.codfw.wmnet, cloudvirt2002-dev.codfw.wmnet, cloudvirt2003-dev.codfw.wmnet) (start with one test host first):
- puppet agent --enable && puppet agent -tv
- apt-get update
- DEBIAN_FRONTEND=noninteractive apt-get install -y python3-libvirt python3-eventlet python3-os-brick python3-os-vif nova-compute neutron-common nova-compute-kvm neutron-linuxbridge-agent python3-neutron python3-oslo.messaging python3-taskflow python3-tooz python3-keystoneauth1 python3-requests python3-urllib3 -o "Dpkg::Options::=--force-confdef" -o "Dpkg::Options::=--force-confold"
- DEBIAN_FRONTEND=noninteractive apt-get dist-upgrade -y --allow-downgrades -o "Dpkg::Options::=--force-confdef" -o "Dpkg::Options::=--force-confold"
- puppet agent -tv
- service neutron-linuxbridge-agent restart
- service libvirtd restart
- service nova-compute restart
- update IRC topic
- enable puppet on all cloud* hosts
$ sudo cumin 'cloud*dev*' "enable-puppet 'Upgrading to openstack Wallaby - T304694 - ${USER}'"
update https://phabricator.wikimedia.org/source/operations-puppet/browse/production/modules/openstack/files/victoria/cinder/hacks/backup/chunkeddriver.py.patch to match current /usr/lib/python3/dist-packages/cinder/backup/chunkeddriver.py file:
https://github.com/openstack/cinder/blob/master/cinder/backup/chunkeddriver.py matched to current branch
https://gerrit.wikimedia.org/r/c/operations/puppet/+/777873
cloudbackup1001-dev.eqiad.wmnet:
- puppet agent --enable && puppet agent -tv
- apt-get update
- DEBIAN_FRONTEND=noninteractive apt upgrade cinder-backup
- puppet agent -tv
- (test from cloudcontrol2004-dev.wikimedia.org) sudo wmcs-cinder-backup-manager
Things to check
- Check 'openstack region list'. There should be exactly one region, codfw1dev-r. If there is a second region named 'RegionOne' (this happened in codfw1dev), delete it; otherwise scripts that enumerate regions will be confused.
- Clean up VMs in the admin-monitoring project that leaked during upgrade; delete them.
- Create a new VM and confirm that DNS and ssh work properly
- Logs will be extremely noisy about policy deprecations and value checks; this is expected because OpenStack is poised between two different policy systems; our existing policies are still (noisily) supported in U.