Page MenuHomePhabricator

upgrade cloud-vps openstack to Openstack version 'Victoria'
Closed, ResolvedPublic

Description

The Designate hosts (cloudservices1003/1004) are already running Victoria. The current Horizon deploy is backwards-compatible with V. So that leaves the cloudcontrol, cloudnet, and cloudvirt nodes to upgrade.

  • update IRC topic
  • downtime everything in icinga through 14:00CDT

    aborrero@cumin1001:~ $ sudo cookbook sre.hosts.downtime -r "upgrading openstack" --min 120 lab*

    aborrero@cumin1001:~ $ sudo cookbook sre.hosts.downtime -r "upgrading openstack" --min 120 cloud*
  • dump databases on cloudcontrol1003: nova_eqiad1, nova_api_eqiad1, nova_cell0_eqiad1, neutron, glance, keystone, cinder:
    1. mysqldump -u root nova_eqiad1 > /root/victoriadbbackups/nova_eqiad1.sql
    2. mysqldump -u root nova_api_eqiad1 > /root/victoriadbbackups/nova_api_eqiad1.sql
    3. mysqldump -u root nova_cell0_eqiad1 > /root/victoriadbbackups/nova_cell0_eqiad1.sql
    4. mysqldump -u root neutron > /root/victoriadbbackups/neutron.sql
    5. mysqldump -u root glance > /root/victoriadbbackups/glance.sql
    6. mysqldump -u root placement > /root/victoriadbbackups/placement.sql
    7. mysqldump -u root keystone > /root/victoriadbbackups/keystone.sql

Cloudcontrols:

All open database connections post-upgrade https://phabricator.wikimedia.org/P10999
Checking haproxy status echo "show stat" | socat /var/run/haproxy/haproxy.sock stdio | grep DOWN

Cloudcontrol1003:

  • puppet agent --enable && puppet agent -tv
  • apt-get update
  • systemctl unmask keystone
  • DEBIAN_FRONTEND=noninteractive apt-get install glance glance-api glance-common keystone nova-api nova-conductor nova-scheduler nova-common glance neutron-server python3-requests python3-urllib3 placement-api cinder-volume cinder-scheduler cinder-api python3-oslo.messaging python3-tooz -o "Dpkg::Options::=--force-confdef" -o "Dpkg::Options::=--force-confold"
  • DEBIAN_FRONTEND=noninteractive apt-get upgrade -o "Dpkg::Options::=--force-confdef" -o "Dpkg::Options::=--force-confold"
  • systemctl mask keystone
  • puppet agent -tv
  • nova-manage api_db sync
  • nova-manage db sync
  • placement-manage db sync
  • glance-manage db_sync
  • keystone-manage db_sync
  • cinder-manage db online_data_migrations
  • cinder-manage db sync
  • puppet agent -tv
  • nova-manage db online_data_migrations
  • systemctl list-units --failed (should show nothing failed, or just keystone. If keystone is failed just reset with systemctl reset-failed)

Cloudcontrol1004:

  • puppet agent --enable && puppet agent -tv
  • apt-get update
  • systemctl unmask keystone
  • DEBIAN_FRONTEND=noninteractive apt-get install glance glance-api glance-common keystone nova-api nova-conductor nova-scheduler nova-common glance neutron-server python3-requests python3-urllib3 placement-api cinder-volume cinder-scheduler cinder-api placement-api python3-oslo.messaging python3-tooz -o "Dpkg::Options::=--force-confdef" -o "Dpkg::Options::=--force-confold"
  • DEBIAN_FRONTEND=noninteractive apt-get upgrade -o "Dpkg::Options::=--force-confdef" -o "Dpkg::Options::=--force-confold"
  • systemctl mask keystone
  • puppet agent -tv
  • puppet agent -tv
  • systemctl list-units --failed (should show nothing failed, or just keystone. If keystone is failed just reset with systemctl reset-failed)

Cloudcontrol1005:

  • puppet agent --enable && puppet agent -tv
  • apt-get update
  • systemctl unmask keystone
  • DEBIAN_FRONTEND=noninteractive apt-get install glance glance-api glance-common keystone nova-api nova-conductor nova-scheduler nova-common glance neutron-server python3-requests python3-urllib3 placement-api cinder-volume cinder-scheduler cinder-api placement-api python3-oslo.messaging python3-tooz -o "Dpkg::Options::=--force-confdef" -o "Dpkg::Options::=--force-confold"
  • DEBIAN_FRONTEND=noninteractive apt-get upgrade -o "Dpkg::Options::=--force-confdef" -o "Dpkg::Options::=--force-confold"
  • systemctl mask keystone
  • puppet agent -tv
  • puppet agent -tv
  • systemctl list-units --failed (should show nothing failed, or just keystone. If keystone is failed just reset with systemctl reset-failed)

cloudnets (one at a time please):

Begin with the standby node, as determined with:

$ neutron l3-agent-list-hosting-router cloudinstances2b-gw

Standby node:

  • puppet agent --enable && puppet agent -tv
  • apt-get update
  • DEBIAN_FRONTEND=noninteractive apt-get install -o "Dpkg::Options::=--force-confdef" -o "Dpkg::Options::=--force-confold" neutron-l3-agent python3-oslo.messaging python3-neutronclient python3-glanceclient
  • DEBIAN_FRONTEND=noninteractive apt-get upgrade -o "Dpkg::Options::=--force-confdef" -o "Dpkg::Options::=--force-confold"
  • puppet agent -tv
  • neutron-db-manage upgrade heads on cloudcontrol1003

Active node:

  • puppet agent --enable && puppet agent -tv
  • apt-get update
  • DEBIAN_FRONTEND=noninteractive apt-get install -o "Dpkg::Options::=--force-confdef" -o "Dpkg::Options::=--force-confold" neutron-l3-agent python3-oslo.messaging python3-neutronclient python3-glanceclient
  • DEBIAN_FRONTEND=noninteractive apt-get upgrade -o "Dpkg::Options::=--force-confdef" -o "Dpkg::Options::=--force-confold"
  • puppet agent -tv

Break Time

Cloudvirts (start with one test host first, cloudvirt1039. Don't forget about cloudvirtwdqs ):

  • puppet agent --enable && puppet agent -tv
  • apt-get update
  • DEBIAN_FRONTEND=noninteractive apt-get install -y python3-libvirt python3-os-vif nova-compute neutron-common nova-compute-kvm neutron-linuxbridge-agent python3-neutron python3-eventlet python3-oslo.messaging python3-taskflow python3-tooz python3-keystoneauth1 python3-positional python3-requests python3-urllib3 -o "Dpkg::Options::=--force-confdef" -o "Dpkg::Options::=--force-confold"
  • DEBIAN_FRONTEND=noninteractive apt-get dist-upgrade -y --allow-downgrades -o "Dpkg::Options::=--force-confdef" -o "Dpkg::Options::=--force-confold"
  • puppet agent -tv
  • service neutron-linuxbridge-agent restart
  • service libvirtd restart
  • service nova-compute restart
  • update IRC topic
  • enable puppet on all cloud* hosts

    $ sudo cumin 'cloud*' "enable-puppet 'Upgrading to openstack Train - T261135 - ${USER}'"

Things to check

  • Check 'openstack region list'. There should be exactly one region, eqiad1-r. If there is a second region named 'RegionOne' (this happened in codfw1dev), delete it; otherwise scripts that enumerate regions will be confused.
  • Clean up VMs in the admin-monitoring project that leaked during upgrade; delete them.
  • Create a new VM and confirm that DNS and ssh work properly
  • Logs will be extremely noisy about policy deprecations and value checks; this is expected because OpenStack is poised between two different policy systems; our existing policies are still (noisily) supported in U.

Related Objects

StatusSubtypeAssignedTask
OpenNone
ResolvedMajavah
ResolvedAndrew
ResolvedAndrew
Resolveddcaro
ResolvedAndrew
ResolvedAndrew
ResolvedAndrew
ResolvedAndrew
ResolvedAndrew
ResolvedAndrew
ResolvedCmjohnson
Resolvedayounsi
Resolvedaborrero
ResolvedCmjohnson
ResolvedJclark-ctr
OpenCmjohnson
ResolvedAndrew
ResolvedAndrew
ResolvedAndrew
ResolvedAndrew
ResolvedAndrew
Opendcaro
Resolvedaborrero
Opendcaro
Resolveddcaro
Opendcaro
Opendcaro
ResolvedAndrew
ResolvedAndrew
ResolvedAndrew
ResolvedAndrew
ResolvedBstorm
Resolvedaborrero
Resolvedaborrero
Resolvedaborrero
OpenNone
Resolvedaborrero

Event Timeline

Change 677340 had a related patch set uploaded (by Andrew Bogott; author: Andrew Bogott):

[operations/puppet@production] OpenStack: Add package and config for Designate/Victoria

https://gerrit.wikimedia.org/r/677340

Change 677340 merged by Andrew Bogott:

[operations/puppet@production] OpenStack: Add package and config for Designate/Victoria

https://gerrit.wikimedia.org/r/677340

Change 677350 had a related patch set uploaded (by Andrew Bogott; author: Andrew Bogott):

[operations/puppet@production] Designate/Victoria: remove a hacked file

https://gerrit.wikimedia.org/r/677350

Change 677351 had a related patch set uploaded (by Andrew Bogott; author: Andrew Bogott):

[operations/puppet@production] Add config and manifests for Openstack version Victoria

https://gerrit.wikimedia.org/r/677351

I've reviewed and applied release note changes for:

  • designate (none)
  • nova (one trivial)
  • glance (none)
  • keystone (??? no notes)
  • cinder (none)

Change 677360 had a related patch set uploaded (by Andrew Bogott; author: Andrew Bogott):

[operations/puppet@production] Removed an uneeded setting from nova.conf

https://gerrit.wikimedia.org/r/677360

Change 677350 merged by Andrew Bogott:

[operations/puppet@production] Designate/Victoria: remove a hacked file

https://gerrit.wikimedia.org/r/677350

Change 677647 had a related patch set uploaded (by Andrew Bogott; author: Andrew Bogott):

[operations/puppet@production] Neutron: forward our dmz hacks to Victoria

https://gerrit.wikimedia.org/r/677647

Change 677658 had a related patch set uploaded (by Andrew Bogott; author: Andrew Bogott):

[operations/puppet@production] codfw1dev designate -> OpenStack Victoria

https://gerrit.wikimedia.org/r/677658

Change 677658 merged by Andrew Bogott:

[operations/puppet@production] codfw1dev designate -> OpenStack Victoria

https://gerrit.wikimedia.org/r/677658

Change 677351 merged by Andrew Bogott:

[operations/puppet@production] Add config and manifests for Openstack version Victoria

https://gerrit.wikimedia.org/r/677351

Change 677360 merged by Andrew Bogott:

[operations/puppet@production] Removed an unneeded setting from nova.conf

https://gerrit.wikimedia.org/r/677360

Change 678833 had a related patch set uploaded (by Andrew Bogott; author: Andrew Bogott):

[operations/puppet@production] eqiad1 designate -> Victoria

https://gerrit.wikimedia.org/r/678833

Change 678833 merged by Andrew Bogott:

[operations/puppet@production] eqiad1 designate -> Victoria

https://gerrit.wikimedia.org/r/678833

Mentioned in SAL (#wikimedia-cloud) [2021-04-13T13:11:06Z] <andrewbogott> upgrading eqiad1 designate to version Victoria, T261137

Change 678840 had a related patch set uploaded (by Andrew Bogott; author: Andrew Bogott):

[operations/puppet@production] cloud-vps codfw1dev -> OpenStack Victoria

https://gerrit.wikimedia.org/r/678840

Change 677647 merged by Andrew Bogott:

[operations/puppet@production] Neutron: forward our dmz hacks to Victoria

https://gerrit.wikimedia.org/r/677647

Change 678840 merged by Andrew Bogott:

[operations/puppet@production] cloud-vps codfw1dev -> OpenStack Victoria

https://gerrit.wikimedia.org/r/678840

Mentioned in SAL (#wikimedia-cloud) [2021-04-13T14:36:46Z] <andrewbogott> upgrading codfw1dev to version Victoria, T261137

Change 682948 had a related patch set uploaded (by Andrew Bogott; author: Andrew Bogott):

[operations/puppet@production] Horizon: put into maintenance mode for Victoria upgrade

https://gerrit.wikimedia.org/r/682948

Change 682949 had a related patch set uploaded (by Andrew Bogott; author: Andrew Bogott):

[operations/puppet@production] cloud-vps eqiad1 -> version Victoria

https://gerrit.wikimedia.org/r/682949

Change 682950 had a related patch set uploaded (by Andrew Bogott; author: Andrew Bogott):

[operations/puppet@production] Revert "Horizon: put into maintenance mode for Victoria upgrade"

https://gerrit.wikimedia.org/r/682950

Change 682948 merged by Andrew Bogott:

[operations/puppet@production] Horizon: put into maintenance mode for Victoria upgrade

https://gerrit.wikimedia.org/r/682948

Change 682949 merged by Andrew Bogott:

[operations/puppet@production] cloud-vps eqiad1 -> version Victoria

https://gerrit.wikimedia.org/r/682949

Change 682950 merged by David Caro:

[operations/puppet@production] Revert "Horizon: put into maintenance mode for Victoria upgrade"

https://gerrit.wikimedia.org/r/682950

Change 683002 had a related patch set uploaded (by Andrew Bogott; author: Andrew Bogott):

[operations/puppet@production] cloud-vps: set cloudvirt nodes to OpenStack U

https://gerrit.wikimedia.org/r/683002

Change 683002 merged by Andrew Bogott:

[operations/puppet@production] cloud-vps: set cloudvirt nodes to OpenStack U

https://gerrit.wikimedia.org/r/683002

Change 683274 had a related patch set uploaded (by David Caro; author: David Caro):

[operations/puppet@production] wmcs.openstack: unpin cloudvirt1039

https://gerrit.wikimedia.org/r/683274

Change 683274 merged by David Caro:

[operations/puppet@production] wmcs.openstack: unpin cloudvirt1039

https://gerrit.wikimedia.org/r/683274

Change 683278 had a related patch set uploaded (by David Caro; author: David Caro):

[operations/puppet@production] wmcs.openstack: unpin cloudvirts to continue upgrade to victoria

https://gerrit.wikimedia.org/r/683278

Change 683278 merged by David Caro:

[operations/puppet@production] wmcs.openstack: unpin cloudvirts to continue upgrade to victoria

https://gerrit.wikimedia.org/r/683278

  • update IRC topic
  • downtime everything in icinga through 14:00CDT

    aborrero@cumin1001:~ $ sudo cookbook sre.hosts.downtime -r "upgrading openstack" --min 120 lab* aborrero@cumin1001:~ $ sudo cookbook sre.hosts.downtime -r "upgrading openstack" --min 120 cloud*
  • dump databases on cloudcontrol1003: nova_eqiad1, nova_api_eqiad1, nova_cell0_eqiad1, neutron, glance, keystone, cinder:
    1. mysqldump -u root nova_eqiad1 > /root/victoriadbbackups/nova_eqiad1.sql
    2. mysqldump -u root nova_api_eqiad1 > /root/victoriadbbackups/nova_api_eqiad1.sql
    3. mysqldump -u root nova_cell0_eqiad1 > /root/victoriadbbackups/nova_cell0_eqiad1.sql
    4. mysqldump -u root neutron > /root/victoriadbbackups/neutron.sql
    5. mysqldump -u root glance > /root/victoriadbbackups/glance.sql
    6. mysqldump -u root placement > /root/victoriadbbackups/placement.sql
    7. mysqldump -u root keystone > /root/victoriadbbackups/keystone.sql
    8. mysqldump -u root cinder > /root/victoriadbbackups/keystone.sql

Cloudcontrols:

All open database connections post-upgrade https://phabricator.wikimedia.org/P10999
Checking haproxy status echo "show stat" | socat /var/run/haproxy/haproxy.sock stdio | grep DOWN

Cloudcontrol1003:

  • puppet agent --enable && puppet agent -tv
  • apt update
  • systemctl unmask keystone
  • DEBIAN_FRONTEND=noninteractive apt-get install glance glance-api glance-common keystone nova-api nova-conductor nova-scheduler nova-common glance neutron-server python3-requests python3-urllib3 placement-api cinder-volume cinder-scheduler cinder-api python3-oslo.messaging python3-tooz -o "Dpkg::Options::=--force-confdef" -o "Dpkg::Options::=--force-confold"
  • DEBIAN_FRONTEND=noninteractive apt-get upgrade -o "Dpkg::Options::=--force-confdef" -o "Dpkg::Options::=--force-confold"
  • systemctl mask keystone
  • puppet agent -tv
  • nova-manage api_db sync
  • nova-manage db sync
  • placement-manage db sync
  • glance-manage db_sync

2021-04-27 14:31:31.766 45009 WARNING oslo_config.cfg [-] Deprecated: Option "sql_idle_timeout" from group "DEFAULT" is deprecated. Use option "connection_recycle_time" from group "database".
2021-04-27 14:31:31.768 45009 WARNING oslo_db.sqlalchemy.engines [-] URL mysql://glance:***@openstack.eqiad1.wikimediacloud.org/glance does not contain a '+drivername' portion, and will make use of a default driver. A full dbname+drivername:// protocol is recommended. For MySQL, it is strongly recommended that mysql+pymysql:// be specified for maximum service compatibility

  • keystone-manage db_sync
  • cinder-manage db online_data_migrations
  • cinder-manage db sync
  • puppet agent -tv
  • puppet agent -tv # again to check config convergence, no changes should happen
  • nova-manage db online_data_migrations
  • systemctl list-units --failed (should show nothing failed, or just keystone. If keystone is failed just reset with systemctl reset-failed)

Cloudcontrol1004:

  • puppet agent --enable && puppet agent -tv
  • apt update
  • systemctl unmask keystone
  • DEBIAN_FRONTEND=noninteractive apt-get install glance glance-api glance-common keystone nova-api nova-conductor nova-scheduler nova-common glance neutron-server python3-requests python3-urllib3 placement-api cinder-volume cinder-scheduler cinder-api placement-api python3-oslo.messaging python3-tooz -o "Dpkg::Options::=--force-confdef" -o "Dpkg::Options::=--force-confold"
  • DEBIAN_FRONTEND=noninteractive apt-get upgrade -o "Dpkg::Options::=--force-confdef" -o "Dpkg::Options::=--force-confold"
  • systemctl mask keystone
  • puppet agent -tv
  • systemctl list-units --failed (should show nothing failed, or just keystone. If keystone is failed just reset with systemctl reset-failed)

Cloudcontrol1005:

  • puppet agent --enable && puppet agent -tv
  • apt update
  • systemctl unmask keystone
  • DEBIAN_FRONTEND=noninteractive apt-get install glance glance-api glance-common keystone nova-api nova-conductor nova-scheduler nova-common glance neutron-server python3-requests python3-urllib3 placement-api cinder-volume cinder-scheduler cinder-api placement-api python3-oslo.messaging python3-tooz -o "Dpkg::Options::=--force-confdef" -o "Dpkg::Options::=--force-confold"
  • DEBIAN_FRONTEND=noninteractive apt-get upgrade -o "Dpkg::Options::=--force-confdef" -o "Dpkg::Options::=--force-confold"
  • systemctl mask keystone
  • puppet agent -tv
  • puppet agent -tv # again to check config convergence, no changes should happen
  • systemctl list-units --failed (should show nothing failed, or just keystone. If keystone is failed just reset with systemctl reset-failed)

cloudnets (one at a time please):

Begin with the standby node, as determined with:

cloudcontrol1003$ source novaenv.sh && neutron l3-agent-list-hosting-router cloudinstances2b-gw
ex:
+--------------------------------------+--------------+----------------+-------+----------+

idhostadmin_state_upaliveha_state

+--------------------------------------+--------------+----------------+-------+----------+

4be214c8-76ef-40f8-9d5d-4c344d213311cloudnet1003True:-)active
970df1d1-505d-47a4-8d35-1b13c0dfe098cloudnet1004True:-)standby

+--------------------------------------+--------------+----------------+-------+----------+

Standby node:

  • puppet agent --enable && puppet agent -tv
  • apt update
  • DEBIAN_FRONTEND=noninteractive apt-get install -o "Dpkg::Options::=--force-confdef" -o "Dpkg::Options::=--force-confold" neutron-l3-agent python3-oslo.messaging python3-neutronclient python3-glanceclient
  • DEBIAN_FRONTEND=noninteractive apt-get upgrade -o "Dpkg::Options::=--force-confdef" -o "Dpkg::Options::=--force-confold"
  • puppet agent -tv
  • neutron-db-manage upgrade heads on cloudcontrol1003

Active node:

  • puppet agent --enable && puppet agent -tv
  • apt update
  • DEBIAN_FRONTEND=noninteractive apt-get install -o "Dpkg::Options::=--force-confdef" -o "Dpkg::Options::=--force-confold" neutron-l3-agent python3-oslo.messaging python3-neutronclient python3-glanceclient
  • DEBIAN_FRONTEND=noninteractive apt-get upgrade -o "Dpkg::Options::=--force-confdef" -o "Dpkg::Options::=--force-confold"
  • puppet agent -tv

Break Time

Cloudvirts (start with one test host first, cloudvirt1039. Don't forget about cloudvirtwdqs ):

  • puppet agent --enable && run-puppet-agent # expected that it might fail
  • apt update
  • DEBIAN_FRONTEND=noninteractive apt-get install -y python3-libvirt python3-os-vif nova-compute neutron-common nova-compute-kvm neutron-linuxbridge-agent python3-neutron python3-eventlet python3-oslo.messaging python3-taskflow python3-tooz python3-keystoneauth1 python3-positional python3-requests python3-urllib3 -o "Dpkg::Options::=--force-confdef" -o "Dpkg::Options::=--force-confold"
  • DEBIAN_FRONTEND=noninteractive apt-get dist-upgrade -y --allow-downgrades -o "Dpkg::Options::=--force-confdef" -o "Dpkg::Options::=--force-confold"
  • run-puppet-agent # expected that it might fail
  • systemctl restart neutron-linuxbridge-agent
  • for some reason libvirtd-tls wants libvirtd to be stopped before starting
  • systemctl stop libvirtd && systemctl start libvirtd-tls.socket && systemctl start libvirtd
  • run-puppet-agent
  • systemctl restart nova-compute
  • cloudvirt-wdqs1001
  • cloudvirt-wdqs1002
  • cloudvirt-wdqs1003
  • cloudvirt1012
  • cloudvirt1013
  • cloudvirt1014
  • cloudvirt1016
  • cloudvirt1017
  • cloudvirt1018
  • cloudvirt1019 (toolsdb)
  • cloudvirt1020 (toolsdb)
  • cloudvirt1021
  • cloudvirt1022
  • cloudvirt1023
  • cloudvirt1024
  • cloudvirt1025
  • cloudvirt1026
  • cloudvirt1027
  • cloudvirt1028
  • cloudvirt1029
  • cloudvirt1030
  • cloudvirt1031
  • cloudvirt1032
  • cloudvirt1033
  • cloudvirt1034
  • cloudvirt1035
  • cloudvirt1036
  • cloudvirt1037
  • cloudvirt1038
  • cloudvirt1039
  • update IRC topic
  • enable puppet on all cloud* hosts

    $ sudo cumin 'cloud*' "enable-puppet 'Upgrading to openstack Victoria - T261137 - ${USER}'"

Things to check

  • Check 'openstack region list'. There should be exactly one region, eqiad1-r. If there is a second region named 'RegionOne' (this happened in codfw1dev), delete it; otherwise scripts that enumerate regions will be confused.
  • Clean up VMs in the admin-monitoring project that leaked during upgrade; delete them.
  • Create a new VM and confirm that DNS and ssh work properly
  • Logs will be extremely noisy about policy deprecations and value checks; this is expected because OpenStack is poised between two different policy systems; our existing policies are still (noisily) supported in U.