Page MenuHomePhabricator

upgrade cloud-vps openstack to Openstack version 'Victoria'
Closed, ResolvedPublic

Description

The Designate hosts (cloudservices1003/1004) are already running Victoria. The current Horizon deploy is backwards-compatible with V. So that leaves the cloudcontrol, cloudnet, and cloudvirt nodes to upgrade.

  • update IRC topic
  • downtime everything in icinga through 14:00CDT

    aborrero@cumin1001:~ $ sudo cookbook sre.hosts.downtime -r "upgrading openstack" --min 120 lab*

    aborrero@cumin1001:~ $ sudo cookbook sre.hosts.downtime -r "upgrading openstack" --min 120 cloud*
  • dump databases on cloudcontrol1003: nova_eqiad1, nova_api_eqiad1, nova_cell0_eqiad1, neutron, glance, keystone, cinder:
    1. mysqldump -u root nova_eqiad1 > /root/victoriadbbackups/nova_eqiad1.sql
    2. mysqldump -u root nova_api_eqiad1 > /root/victoriadbbackups/nova_api_eqiad1.sql
    3. mysqldump -u root nova_cell0_eqiad1 > /root/victoriadbbackups/nova_cell0_eqiad1.sql
    4. mysqldump -u root neutron > /root/victoriadbbackups/neutron.sql
    5. mysqldump -u root glance > /root/victoriadbbackups/glance.sql
    6. mysqldump -u root placement > /root/victoriadbbackups/placement.sql
    7. mysqldump -u root keystone > /root/victoriadbbackups/keystone.sql

Cloudcontrols:

All open database connections post-upgrade https://phabricator.wikimedia.org/P10999
Checking haproxy status echo "show stat" | socat /var/run/haproxy/haproxy.sock stdio | grep DOWN

Cloudcontrol1003:

  • puppet agent --enable && puppet agent -tv
  • apt-get update
  • systemctl unmask keystone
  • DEBIAN_FRONTEND=noninteractive apt-get install glance glance-api glance-common keystone nova-api nova-conductor nova-scheduler nova-common glance neutron-server python3-requests python3-urllib3 placement-api cinder-volume cinder-scheduler cinder-api python3-oslo.messaging python3-tooz -o "Dpkg::Options::=--force-confdef" -o "Dpkg::Options::=--force-confold"
  • DEBIAN_FRONTEND=noninteractive apt-get upgrade -o "Dpkg::Options::=--force-confdef" -o "Dpkg::Options::=--force-confold"
  • systemctl mask keystone
  • puppet agent -tv
  • nova-manage api_db sync
  • nova-manage db sync
  • placement-manage db sync
  • glance-manage db_sync
  • keystone-manage db_sync
  • cinder-manage db online_data_migrations
  • cinder-manage db sync
  • puppet agent -tv
  • nova-manage db online_data_migrations
  • systemctl list-units --failed (should show nothing failed, or just keystone. If keystone is failed just reset with systemctl reset-failed)

Cloudcontrol1004:

  • puppet agent --enable && puppet agent -tv
  • apt-get update
  • systemctl unmask keystone
  • DEBIAN_FRONTEND=noninteractive apt-get install glance glance-api glance-common keystone nova-api nova-conductor nova-scheduler nova-common glance neutron-server python3-requests python3-urllib3 placement-api cinder-volume cinder-scheduler cinder-api placement-api python3-oslo.messaging python3-tooz -o "Dpkg::Options::=--force-confdef" -o "Dpkg::Options::=--force-confold"
  • DEBIAN_FRONTEND=noninteractive apt-get upgrade -o "Dpkg::Options::=--force-confdef" -o "Dpkg::Options::=--force-confold"
  • systemctl mask keystone
  • puppet agent -tv
  • puppet agent -tv
  • systemctl list-units --failed (should show nothing failed, or just keystone. If keystone is failed just reset with systemctl reset-failed)

Cloudcontrol1005:

  • puppet agent --enable && puppet agent -tv
  • apt-get update
  • systemctl unmask keystone
  • DEBIAN_FRONTEND=noninteractive apt-get install glance glance-api glance-common keystone nova-api nova-conductor nova-scheduler nova-common glance neutron-server python3-requests python3-urllib3 placement-api cinder-volume cinder-scheduler cinder-api placement-api python3-oslo.messaging python3-tooz -o "Dpkg::Options::=--force-confdef" -o "Dpkg::Options::=--force-confold"
  • DEBIAN_FRONTEND=noninteractive apt-get upgrade -o "Dpkg::Options::=--force-confdef" -o "Dpkg::Options::=--force-confold"
  • systemctl mask keystone
  • puppet agent -tv
  • puppet agent -tv
  • systemctl list-units --failed (should show nothing failed, or just keystone. If keystone is failed just reset with systemctl reset-failed)

cloudnets (one at a time please):

Begin with the standby node, as determined with:

$ neutron l3-agent-list-hosting-router cloudinstances2b-gw

Standby node:

  • puppet agent --enable && puppet agent -tv
  • apt-get update
  • DEBIAN_FRONTEND=noninteractive apt-get install -o "Dpkg::Options::=--force-confdef" -o "Dpkg::Options::=--force-confold" neutron-l3-agent python3-oslo.messaging python3-neutronclient python3-glanceclient
  • DEBIAN_FRONTEND=noninteractive apt-get upgrade -o "Dpkg::Options::=--force-confdef" -o "Dpkg::Options::=--force-confold"
  • puppet agent -tv
  • neutron-db-manage upgrade heads on cloudcontrol1003

Active node:

  • puppet agent --enable && puppet agent -tv
  • apt-get update
  • DEBIAN_FRONTEND=noninteractive apt-get install -o "Dpkg::Options::=--force-confdef" -o "Dpkg::Options::=--force-confold" neutron-l3-agent python3-oslo.messaging python3-neutronclient python3-glanceclient
  • DEBIAN_FRONTEND=noninteractive apt-get upgrade -o "Dpkg::Options::=--force-confdef" -o "Dpkg::Options::=--force-confold"
  • puppet agent -tv

Break Time

Cloudvirts (start with one test host first, cloudvirt1039. Don't forget about cloudvirtwdqs ):

  • puppet agent --enable && puppet agent -tv
  • apt-get update
  • DEBIAN_FRONTEND=noninteractive apt-get install -y python3-libvirt python3-os-vif nova-compute neutron-common nova-compute-kvm neutron-linuxbridge-agent python3-neutron python3-eventlet python3-oslo.messaging python3-taskflow python3-tooz python3-keystoneauth1 python3-positional python3-requests python3-urllib3 -o "Dpkg::Options::=--force-confdef" -o "Dpkg::Options::=--force-confold"
  • DEBIAN_FRONTEND=noninteractive apt-get dist-upgrade -y --allow-downgrades -o "Dpkg::Options::=--force-confdef" -o "Dpkg::Options::=--force-confold"
  • puppet agent -tv
  • service neutron-linuxbridge-agent restart
  • service libvirtd restart
  • service nova-compute restart
  • update IRC topic
  • enable puppet on all cloud* hosts

    $ sudo cumin 'cloud*' "enable-puppet 'Upgrading to openstack Train - T261135 - ${USER}'"

Things to check

  • Check 'openstack region list'. There should be exactly one region, eqiad1-r. If there is a second region named 'RegionOne' (this happened in codfw1dev), delete it; otherwise scripts that enumerate regions will be confused.
  • Clean up VMs in the admin-monitoring project that leaked during upgrade; delete them.
  • Create a new VM and confirm that DNS and ssh work properly
  • Logs will be extremely noisy about policy deprecations and value checks; this is expected because OpenStack is poised between two different policy systems; our existing policies are still (noisily) supported in U.

Related Objects

StatusSubtypeAssignedTask
ResolvedAndrew
Resolvedtaavi
ResolvedAndrew
ResolvedAndrew
Resolveddcaro
ResolvedAndrew
ResolvedAndrew
ResolvedAndrew
ResolvedAndrew
ResolvedAndrew
ResolvedAndrew
Resolved Cmjohnson
Resolvedayounsi
Resolved aborrero
Resolved Cmjohnson
ResolvedJclark-ctr
ResolvedJclark-ctr
ResolvedAndrew
ResolvedAndrew
ResolvedAndrew
ResolvedAndrew
ResolvedAndrew
Resolveddcaro
Resolved aborrero
Declineddcaro
Resolveddcaro
OpenNone
OpenNone
ResolvedAndrew
ResolvedAndrew
ResolvedAndrew
ResolvedAndrew
Resolved Bstorm
Resolved aborrero
Resolved aborrero
Resolved aborrero
InvalidNone
Resolved aborrero

Event Timeline

Change 677340 had a related patch set uploaded (by Andrew Bogott; author: Andrew Bogott):

[operations/puppet@production] OpenStack: Add package and config for Designate/Victoria

https://gerrit.wikimedia.org/r/677340

Change 677340 merged by Andrew Bogott:

[operations/puppet@production] OpenStack: Add package and config for Designate/Victoria

https://gerrit.wikimedia.org/r/677340

Change 677350 had a related patch set uploaded (by Andrew Bogott; author: Andrew Bogott):

[operations/puppet@production] Designate/Victoria: remove a hacked file

https://gerrit.wikimedia.org/r/677350

Change 677351 had a related patch set uploaded (by Andrew Bogott; author: Andrew Bogott):

[operations/puppet@production] Add config and manifests for Openstack version Victoria

https://gerrit.wikimedia.org/r/677351

I've reviewed and applied release note changes for:

  • designate (none)
  • nova (one trivial)
  • glance (none)
  • keystone (??? no notes)
  • cinder (none)

Change 677360 had a related patch set uploaded (by Andrew Bogott; author: Andrew Bogott):

[operations/puppet@production] Removed an uneeded setting from nova.conf

https://gerrit.wikimedia.org/r/677360

Change 677350 merged by Andrew Bogott:

[operations/puppet@production] Designate/Victoria: remove a hacked file

https://gerrit.wikimedia.org/r/677350

Change 677647 had a related patch set uploaded (by Andrew Bogott; author: Andrew Bogott):

[operations/puppet@production] Neutron: forward our dmz hacks to Victoria

https://gerrit.wikimedia.org/r/677647

Change 677658 had a related patch set uploaded (by Andrew Bogott; author: Andrew Bogott):

[operations/puppet@production] codfw1dev designate -> OpenStack Victoria

https://gerrit.wikimedia.org/r/677658

Change 677658 merged by Andrew Bogott:

[operations/puppet@production] codfw1dev designate -> OpenStack Victoria

https://gerrit.wikimedia.org/r/677658

Change 677351 merged by Andrew Bogott:

[operations/puppet@production] Add config and manifests for Openstack version Victoria

https://gerrit.wikimedia.org/r/677351

Change 677360 merged by Andrew Bogott:

[operations/puppet@production] Removed an unneeded setting from nova.conf

https://gerrit.wikimedia.org/r/677360

Change 678833 had a related patch set uploaded (by Andrew Bogott; author: Andrew Bogott):

[operations/puppet@production] eqiad1 designate -> Victoria

https://gerrit.wikimedia.org/r/678833

Change 678833 merged by Andrew Bogott:

[operations/puppet@production] eqiad1 designate -> Victoria

https://gerrit.wikimedia.org/r/678833

Mentioned in SAL (#wikimedia-cloud) [2021-04-13T13:11:06Z] <andrewbogott> upgrading eqiad1 designate to version Victoria, T261137

Change 678840 had a related patch set uploaded (by Andrew Bogott; author: Andrew Bogott):

[operations/puppet@production] cloud-vps codfw1dev -> OpenStack Victoria

https://gerrit.wikimedia.org/r/678840

Change 677647 merged by Andrew Bogott:

[operations/puppet@production] Neutron: forward our dmz hacks to Victoria

https://gerrit.wikimedia.org/r/677647

Change 678840 merged by Andrew Bogott:

[operations/puppet@production] cloud-vps codfw1dev -> OpenStack Victoria

https://gerrit.wikimedia.org/r/678840

Mentioned in SAL (#wikimedia-cloud) [2021-04-13T14:36:46Z] <andrewbogott> upgrading codfw1dev to version Victoria, T261137

Change 682948 had a related patch set uploaded (by Andrew Bogott; author: Andrew Bogott):

[operations/puppet@production] Horizon: put into maintenance mode for Victoria upgrade

https://gerrit.wikimedia.org/r/682948

Change 682949 had a related patch set uploaded (by Andrew Bogott; author: Andrew Bogott):

[operations/puppet@production] cloud-vps eqiad1 -> version Victoria

https://gerrit.wikimedia.org/r/682949

Change 682950 had a related patch set uploaded (by Andrew Bogott; author: Andrew Bogott):

[operations/puppet@production] Revert "Horizon: put into maintenance mode for Victoria upgrade"

https://gerrit.wikimedia.org/r/682950

Change 682948 merged by Andrew Bogott:

[operations/puppet@production] Horizon: put into maintenance mode for Victoria upgrade

https://gerrit.wikimedia.org/r/682948

Change 682949 merged by Andrew Bogott:

[operations/puppet@production] cloud-vps eqiad1 -> version Victoria

https://gerrit.wikimedia.org/r/682949

Change 682950 merged by David Caro:

[operations/puppet@production] Revert "Horizon: put into maintenance mode for Victoria upgrade"

https://gerrit.wikimedia.org/r/682950

Change 683002 had a related patch set uploaded (by Andrew Bogott; author: Andrew Bogott):

[operations/puppet@production] cloud-vps: set cloudvirt nodes to OpenStack U

https://gerrit.wikimedia.org/r/683002

Change 683002 merged by Andrew Bogott:

[operations/puppet@production] cloud-vps: set cloudvirt nodes to OpenStack U

https://gerrit.wikimedia.org/r/683002

Change 683274 had a related patch set uploaded (by David Caro; author: David Caro):

[operations/puppet@production] wmcs.openstack: unpin cloudvirt1039

https://gerrit.wikimedia.org/r/683274

Change 683274 merged by David Caro:

[operations/puppet@production] wmcs.openstack: unpin cloudvirt1039

https://gerrit.wikimedia.org/r/683274

Change 683278 had a related patch set uploaded (by David Caro; author: David Caro):

[operations/puppet@production] wmcs.openstack: unpin cloudvirts to continue upgrade to victoria

https://gerrit.wikimedia.org/r/683278

Change 683278 merged by David Caro:

[operations/puppet@production] wmcs.openstack: unpin cloudvirts to continue upgrade to victoria

https://gerrit.wikimedia.org/r/683278

  • update IRC topic
  • downtime everything in icinga through 14:00CDT

    aborrero@cumin1001:~ $ sudo cookbook sre.hosts.downtime -r "upgrading openstack" --min 120 lab* aborrero@cumin1001:~ $ sudo cookbook sre.hosts.downtime -r "upgrading openstack" --min 120 cloud*
  • dump databases on cloudcontrol1003: nova_eqiad1, nova_api_eqiad1, nova_cell0_eqiad1, neutron, glance, keystone, cinder:
    1. mysqldump -u root nova_eqiad1 > /root/victoriadbbackups/nova_eqiad1.sql
    2. mysqldump -u root nova_api_eqiad1 > /root/victoriadbbackups/nova_api_eqiad1.sql
    3. mysqldump -u root nova_cell0_eqiad1 > /root/victoriadbbackups/nova_cell0_eqiad1.sql
    4. mysqldump -u root neutron > /root/victoriadbbackups/neutron.sql
    5. mysqldump -u root glance > /root/victoriadbbackups/glance.sql
    6. mysqldump -u root placement > /root/victoriadbbackups/placement.sql
    7. mysqldump -u root keystone > /root/victoriadbbackups/keystone.sql
    8. mysqldump -u root cinder > /root/victoriadbbackups/keystone.sql

Cloudcontrols:

All open database connections post-upgrade https://phabricator.wikimedia.org/P10999
Checking haproxy status echo "show stat" | socat /var/run/haproxy/haproxy.sock stdio | grep DOWN

Cloudcontrol1003:

  • puppet agent --enable && puppet agent -tv
  • apt update
  • systemctl unmask keystone
  • DEBIAN_FRONTEND=noninteractive apt-get install glance glance-api glance-common keystone nova-api nova-conductor nova-scheduler nova-common glance neutron-server python3-requests python3-urllib3 placement-api cinder-volume cinder-scheduler cinder-api python3-oslo.messaging python3-tooz -o "Dpkg::Options::=--force-confdef" -o "Dpkg::Options::=--force-confold"
  • DEBIAN_FRONTEND=noninteractive apt-get upgrade -o "Dpkg::Options::=--force-confdef" -o "Dpkg::Options::=--force-confold"
  • systemctl mask keystone
  • puppet agent -tv
  • nova-manage api_db sync
  • nova-manage db sync
  • placement-manage db sync
  • glance-manage db_sync

2021-04-27 14:31:31.766 45009 WARNING oslo_config.cfg [-] Deprecated: Option "sql_idle_timeout" from group "DEFAULT" is deprecated. Use option "connection_recycle_time" from group "database".
2021-04-27 14:31:31.768 45009 WARNING oslo_db.sqlalchemy.engines [-] URL mysql://glance:***@openstack.eqiad1.wikimediacloud.org/glance does not contain a '+drivername' portion, and will make use of a default driver. A full dbname+drivername:// protocol is recommended. For MySQL, it is strongly recommended that mysql+pymysql:// be specified for maximum service compatibility

  • keystone-manage db_sync
  • cinder-manage db online_data_migrations
  • cinder-manage db sync
  • puppet agent -tv
  • puppet agent -tv # again to check config convergence, no changes should happen
  • nova-manage db online_data_migrations
  • systemctl list-units --failed (should show nothing failed, or just keystone. If keystone is failed just reset with systemctl reset-failed)

Cloudcontrol1004:

  • puppet agent --enable && puppet agent -tv
  • apt update
  • systemctl unmask keystone
  • DEBIAN_FRONTEND=noninteractive apt-get install glance glance-api glance-common keystone nova-api nova-conductor nova-scheduler nova-common glance neutron-server python3-requests python3-urllib3 placement-api cinder-volume cinder-scheduler cinder-api placement-api python3-oslo.messaging python3-tooz -o "Dpkg::Options::=--force-confdef" -o "Dpkg::Options::=--force-confold"
  • DEBIAN_FRONTEND=noninteractive apt-get upgrade -o "Dpkg::Options::=--force-confdef" -o "Dpkg::Options::=--force-confold"
  • systemctl mask keystone
  • puppet agent -tv
  • systemctl list-units --failed (should show nothing failed, or just keystone. If keystone is failed just reset with systemctl reset-failed)

Cloudcontrol1005:

  • puppet agent --enable && puppet agent -tv
  • apt update
  • systemctl unmask keystone
  • DEBIAN_FRONTEND=noninteractive apt-get install glance glance-api glance-common keystone nova-api nova-conductor nova-scheduler nova-common glance neutron-server python3-requests python3-urllib3 placement-api cinder-volume cinder-scheduler cinder-api placement-api python3-oslo.messaging python3-tooz -o "Dpkg::Options::=--force-confdef" -o "Dpkg::Options::=--force-confold"
  • DEBIAN_FRONTEND=noninteractive apt-get upgrade -o "Dpkg::Options::=--force-confdef" -o "Dpkg::Options::=--force-confold"
  • systemctl mask keystone
  • puppet agent -tv
  • puppet agent -tv # again to check config convergence, no changes should happen
  • systemctl list-units --failed (should show nothing failed, or just keystone. If keystone is failed just reset with systemctl reset-failed)

cloudnets (one at a time please):

Begin with the standby node, as determined with:

cloudcontrol1003$ source novaenv.sh && neutron l3-agent-list-hosting-router cloudinstances2b-gw
ex:
+--------------------------------------+--------------+----------------+-------+----------+

idhostadmin_state_upaliveha_state

+--------------------------------------+--------------+----------------+-------+----------+

4be214c8-76ef-40f8-9d5d-4c344d213311cloudnet1003True:-)active
970df1d1-505d-47a4-8d35-1b13c0dfe098cloudnet1004True:-)standby

+--------------------------------------+--------------+----------------+-------+----------+

Standby node:

  • puppet agent --enable && puppet agent -tv
  • apt update
  • DEBIAN_FRONTEND=noninteractive apt-get install -o "Dpkg::Options::=--force-confdef" -o "Dpkg::Options::=--force-confold" neutron-l3-agent python3-oslo.messaging python3-neutronclient python3-glanceclient
  • DEBIAN_FRONTEND=noninteractive apt-get upgrade -o "Dpkg::Options::=--force-confdef" -o "Dpkg::Options::=--force-confold"
  • puppet agent -tv
  • neutron-db-manage upgrade heads on cloudcontrol1003

Active node:

  • puppet agent --enable && puppet agent -tv
  • apt update
  • DEBIAN_FRONTEND=noninteractive apt-get install -o "Dpkg::Options::=--force-confdef" -o "Dpkg::Options::=--force-confold" neutron-l3-agent python3-oslo.messaging python3-neutronclient python3-glanceclient
  • DEBIAN_FRONTEND=noninteractive apt-get upgrade -o "Dpkg::Options::=--force-confdef" -o "Dpkg::Options::=--force-confold"
  • puppet agent -tv

Break Time

Cloudvirts (start with one test host first, cloudvirt1039. Don't forget about cloudvirtwdqs ):

  • puppet agent --enable && run-puppet-agent # expected that it might fail
  • apt update
  • DEBIAN_FRONTEND=noninteractive apt-get install -y python3-libvirt python3-os-vif nova-compute neutron-common nova-compute-kvm neutron-linuxbridge-agent python3-neutron python3-eventlet python3-oslo.messaging python3-taskflow python3-tooz python3-keystoneauth1 python3-positional python3-requests python3-urllib3 -o "Dpkg::Options::=--force-confdef" -o "Dpkg::Options::=--force-confold"
  • DEBIAN_FRONTEND=noninteractive apt-get dist-upgrade -y --allow-downgrades -o "Dpkg::Options::=--force-confdef" -o "Dpkg::Options::=--force-confold"
  • run-puppet-agent # expected that it might fail
  • systemctl restart neutron-linuxbridge-agent
  • for some reason libvirtd-tls wants libvirtd to be stopped before starting
  • systemctl stop libvirtd && systemctl start libvirtd-tls.socket && systemctl start libvirtd
  • run-puppet-agent
  • systemctl restart nova-compute
  • cloudvirt-wdqs1001
  • cloudvirt-wdqs1002
  • cloudvirt-wdqs1003
  • cloudvirt1012
  • cloudvirt1013
  • cloudvirt1014
  • cloudvirt1016
  • cloudvirt1017
  • cloudvirt1018
  • cloudvirt1019 (toolsdb)
  • cloudvirt1020 (toolsdb)
  • cloudvirt1021
  • cloudvirt1022
  • cloudvirt1023
  • cloudvirt1024
  • cloudvirt1025
  • cloudvirt1026
  • cloudvirt1027
  • cloudvirt1028
  • cloudvirt1029
  • cloudvirt1030
  • cloudvirt1031
  • cloudvirt1032
  • cloudvirt1033
  • cloudvirt1034
  • cloudvirt1035
  • cloudvirt1036
  • cloudvirt1037
  • cloudvirt1038
  • cloudvirt1039
  • update IRC topic
  • enable puppet on all cloud* hosts

    $ sudo cumin 'cloud*' "enable-puppet 'Upgrading to openstack Victoria - T261137 - ${USER}'"

Things to check

  • Check 'openstack region list'. There should be exactly one region, eqiad1-r. If there is a second region named 'RegionOne' (this happened in codfw1dev), delete it; otherwise scripts that enumerate regions will be confused.
  • Clean up VMs in the admin-monitoring project that leaked during upgrade; delete them.
  • Create a new VM and confirm that DNS and ssh work properly
  • Logs will be extremely noisy about policy deprecations and value checks; this is expected because OpenStack is poised between two different policy systems; our existing policies are still (noisily) supported in U.