Page MenuHomePhabricator

Upgrade cloud-vps openstack to version 'Antelope'
Closed, ResolvedPublic

Related Objects

StatusSubtypeAssignedTask
Resolvedfnegri
Resolvedrook
ResolvedAndrew
ResolvedAndrew
ResolvedNone
ResolvedNone
ResolvedNone
ResolvedAndrew
Resolvedfnegri
Resolvedfnegri
Resolvedfnegri
ResolvedAndrew
Resolvedfnegri
Resolvedfnegri
Resolvedfnegri
ResolvedNone
Resolvedfnegri
Resolvedfnegri
Resolvedfnegri
Resolvedfnegri
Resolvedfnegri
Declinedfnegri
Invalidfnegri

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

@dr0ptp4kt I was told by @Andrew that you were interested in following this upgrade procedure. This is the main tracking task, feel free to add questions here or ping me in IRC. It's the first time I try to follow this procedure so I'm likely to encounter a few roadblocks, I'll do my best to document them here in Phabricator.

Change 954056 merged by FNegri:

[operations/puppet@production] [openstack] upgrade codfw1dev to Antelope (2023.1)

https://gerrit.wikimedia.org/r/954056

Running the cookbook upgrade_openstack_node on the first cloudcontrol node failed with:

fnegri@cloudcumin1001:~$ sudo cookbook wmcs.openstack.cloudcontrol.upgrade_openstack_node --fqdn-to-upgrade cloudcontrol2001-dev.codfw.wmnet

[...]

E: The repository 'http://mirrors.wikimedia.org/osbpo bullseye-antelope-backports-nochange Release' does not have a Release file.
E: The repository 'http://mirrors.wikimedia.org/osbpo bullseye-antelope-backports Release' does not have a Release file.

It looks like Antelope is not packaged for Bullseye but only for Bookworm: https://mirrors.wikimedia.org/osbpo/pool/

We should probably upgrade all the servers to Bookworm first (keeping OpenStack on version Zed), and then upgrade OpenStack to Antelope.

fnegri changed the task status from In Progress to Stalled.Sep 21 2023, 11:46 AM

Change 963029 had a related patch set uploaded (by FNegri; author: FNegri):

[operations/puppet@production] Revert "Revert "[openstack] upgrade codfw1dev to Antelope (2023.1)""

https://gerrit.wikimedia.org/r/963029

Change 963029 merged by FNegri:

[operations/puppet@production] Revert "Revert "[openstack] upgrade codfw1dev to Antelope (2023.1)""

https://gerrit.wikimedia.org/r/963029

Mentioned in SAL (#wikimedia-cloud-feed) [2023-10-04T13:49:58Z] <wm-bot2> fran@wmf3169 END (FAIL) - Cookbook wmcs.openstack.cloudcontrol.upgrade_openstack_node (exit_code=99) (T341285)

The cookbook failed with the following error

Database expansion failed. Database expansion should have brought the database version up to "2023_1_expand01" revision. But, current revisions are: ('wallaby_contract01',)

Mentioned in SAL (#wikimedia-cloud-feed) [2023-10-04T14:40:50Z] <wm-bot2> fran@wmf3169 START - Cookbook wmcs.openstack.cloudcontrol.upgrade_openstack_node (T341285)

Mentioned in SAL (#wikimedia-cloud-feed) [2023-10-04T14:41:11Z] <wm-bot2> fran@wmf3169 END (FAIL) - Cookbook wmcs.openstack.cloudcontrol.upgrade_openstack_node (exit_code=99) (T341285)

Running the failing command manually worked just fine:

root@cloudcontrol2001-dev:~# glance-manage db sync
2023-10-04 14:26:22.932 3125648 INFO alembic.runtime.migration [-] Context impl MySQLImpl.
2023-10-04 14:26:22.933 3125648 INFO alembic.runtime.migration [-] Will assume non-transactional DDL.
2023-10-04 14:26:22.960 3125648 INFO alembic.runtime.migration [-] Context impl MySQLImpl.
2023-10-04 14:26:22.960 3125648 INFO alembic.runtime.migration [-] Will assume non-transactional DDL.
Database expansion is up to date. No expansion needed.
2023-10-04 14:26:22.982 3125648 INFO alembic.runtime.migration [-] Context impl MySQLImpl.
2023-10-04 14:26:22.983 3125648 INFO alembic.runtime.migration [-] Will assume non-transactional DDL.
Database migration is up to date. No migration needed.
2023-10-04 14:26:23.003 3125648 INFO alembic.runtime.migration [-] Context impl MySQLImpl.
2023-10-04 14:26:23.004 3125648 INFO alembic.runtime.migration [-] Will assume non-transactional DDL.
INFO  [alembic.runtime.migration] Context impl MySQLImpl.
INFO  [alembic.runtime.migration] Will assume non-transactional DDL.
INFO  [alembic.runtime.migration] Running upgrade wallaby_contract01 -> xena_contract01
INFO  [alembic.runtime.migration] Running upgrade xena_contract01 -> yoga_contract01
INFO  [alembic.runtime.migration] Running upgrade yoga_contract01 -> zed_contract01
INFO  [alembic.runtime.migration] Running upgrade zed_contract01 -> 2023_1_contract01
INFO  [alembic.runtime.migration] Context impl MySQLImpl.
INFO  [alembic.runtime.migration] Will assume non-transactional DDL.
Upgraded database to: 2023_1_contract01, current revision(s): 2023_1_contract01
INFO  [alembic.runtime.migration] Context impl MySQLImpl.
INFO  [alembic.runtime.migration] Will assume non-transactional DDL.
Database is synced successfully.

Mentioned in SAL (#wikimedia-cloud-feed) [2023-10-04T14:44:41Z] <wm-bot2> fran@wmf3169 START - Cookbook wmcs.openstack.cloudcontrol.upgrade_openstack_node (T341285)

Mentioned in SAL (#wikimedia-cloud-feed) [2023-10-04T14:54:53Z] <wm-bot2> fran@wmf3169 END (PASS) - Cookbook wmcs.openstack.cloudcontrol.upgrade_openstack_node (exit_code=0) (T341285)

Mentioned in SAL (#wikimedia-cloud-feed) [2023-10-05T09:32:59Z] <wm-bot2> fran@wmf3169 START - Cookbook wmcs.openstack.cloudcontrol.upgrade_openstack_node (T341285)

Mentioned in SAL (#wikimedia-cloud-feed) [2023-10-05T09:40:01Z] <wm-bot2> fran@wmf3169 END (PASS) - Cookbook wmcs.openstack.cloudcontrol.upgrade_openstack_node (exit_code=0) (T341285)

Mentioned in SAL (#wikimedia-cloud-feed) [2023-10-05T09:47:28Z] <wm-bot2> fran@wmf3169 START - Cookbook wmcs.openstack.cloudcontrol.upgrade_openstack_node (T341285)

Mentioned in SAL (#wikimedia-cloud-feed) [2023-10-05T09:54:42Z] <wm-bot2> fran@wmf3169 END (PASS) - Cookbook wmcs.openstack.cloudcontrol.upgrade_openstack_node (exit_code=0) (T341285)

Mentioned in SAL (#wikimedia-cloud-feed) [2023-10-05T16:11:32Z] <wm-bot2> fran@wmf3169 START - Cookbook wmcs.openstack.cloudcontrol.upgrade_openstack_node (T341285)

Mentioned in SAL (#wikimedia-cloud-feed) [2023-10-05T16:24:09Z] <wm-bot2> fran@wmf3169 END (PASS) - Cookbook wmcs.openstack.cloudcontrol.upgrade_openstack_node (exit_code=0) (T341285)

Mentioned in SAL (#wikimedia-cloud-feed) [2023-10-05T16:29:31Z] <wm-bot2> fran@wmf3169 START - Cookbook wmcs.openstack.cloudcontrol.upgrade_openstack_node (T341285)

Mentioned in SAL (#wikimedia-cloud-feed) [2023-10-05T16:41:36Z] <wm-bot2> fran@wmf3169 END (PASS) - Cookbook wmcs.openstack.cloudcontrol.upgrade_openstack_node (exit_code=0) (T341285)

The cookbook has been run successfully on the following nodes:

  • cloudcontrol2001-dev
  • cloudcontrol2004-dev
  • cloudcontrol2005-dev
  • cloudnet2005-dev
  • cloudnet2006-dev

There are some puppet errors in cloudcontrols. Once those are fixed, the cookbook can be run on the remaining nodes:

  • cloudservices[2004-2005]-dev
  • cloudvirt[2001-2006]-dev

So far all the cloudcontrol issues have been related to obsolete init scripts. I updated many (all?) of our init scripts in puppet to match the packaged ones, and now cloudcontrols seem to be working.

https://gerrit.wikimedia.org/r/c/operations/puppet/+/964041
https://gerrit.wikimedia.org/r/c/operations/puppet/+/964045
etc.

Similar changes may be needed for designate.

Change 964164 had a related patch set uploaded (by Andrew Bogott; author: Andrew Bogott):

[operations/puppet@production] Update cinder-api init.d file to match upstream packaged version

https://gerrit.wikimedia.org/r/964164

Change 964165 had a related patch set uploaded (by Andrew Bogott; author: Andrew Bogott):

[operations/puppet@production] heat-api: update init file to match upstream packaged version

https://gerrit.wikimedia.org/r/964165

Change 964166 had a related patch set uploaded (by Andrew Bogott; author: Andrew Bogott):

[operations/puppet@production] magnum-api: update init file to match upstream package

https://gerrit.wikimedia.org/r/964166

Change 964164 merged by Andrew Bogott:

[operations/puppet@production] Update cinder-api init.d file to match upstream packaged version

https://gerrit.wikimedia.org/r/964164

Change 964165 merged by Andrew Bogott:

[operations/puppet@production] heat-api: update init file to match upstream packaged version

https://gerrit.wikimedia.org/r/964165

Change 964166 merged by Andrew Bogott:

[operations/puppet@production] magnum-api: update init file to match upstream package

https://gerrit.wikimedia.org/r/964166

Change 964169 had a related patch set uploaded (by Andrew Bogott; author: Andrew Bogott):

[operations/puppet@production] heat-api-cfn: update init file to match upstream packaged version

https://gerrit.wikimedia.org/r/964169

Change 964169 merged by Andrew Bogott:

[operations/puppet@production] heat-api-cfn: update init file to match upstream packaged version

https://gerrit.wikimedia.org/r/964169

Something is still broken in cloudcontrol2001-dev, the service cinder-scheduler is failing with Unable to connect to AMQP server on rabbitmq03.codfw1dev.wikimediacloud.org:5671 after inf tries.

The error above was fixed by restarting rabbitmq-server in cloudcontrol2005-dev (which is the host corresponding to rabbitmq03).

I am now proceeding with running the upgrade_openstack_node cookbook on cloudservices200[45] hosts.

Mentioned in SAL (#wikimedia-cloud-feed) [2023-10-09T13:45:16Z] <wm-bot2> fran@wmf3169 START - Cookbook wmcs.openstack.cloudcontrol.upgrade_openstack_node (T341285)

Mentioned in SAL (#wikimedia-cloud-feed) [2023-10-09T13:55:03Z] <wm-bot2> fran@wmf3169 END (PASS) - Cookbook wmcs.openstack.cloudcontrol.upgrade_openstack_node (exit_code=0) (T341285)

Mentioned in SAL (#wikimedia-cloud-feed) [2023-10-09T15:26:12Z] <wm-bot2> fran@wmf3169 END (FAIL) - Cookbook wmcs.openstack.cloudcontrol.upgrade_openstack_node (exit_code=99) (T341285)

Mentioned in SAL (#wikimedia-cloud-feed) [2023-10-09T15:41:56Z] <wm-bot2> fran@wmf3169 START - Cookbook wmcs.openstack.cloudcontrol.upgrade_openstack_node (T341285)

Mentioned in SAL (#wikimedia-cloud-feed) [2023-10-09T15:49:56Z] <wm-bot2> fran@wmf3169 END (FAIL) - Cookbook wmcs.openstack.cloudcontrol.upgrade_openstack_node (exit_code=99) (T341285)

Mentioned in SAL (#wikimedia-cloud-feed) [2023-10-09T16:18:10Z] <wm-bot2> fran@wmf3169 START - Cookbook wmcs.openstack.cloudcontrol.upgrade_openstack_node (T341285)

Mentioned in SAL (#wikimedia-cloud-feed) [2023-10-09T16:26:19Z] <wm-bot2> fran@wmf3169 END (PASS) - Cookbook wmcs.openstack.cloudcontrol.upgrade_openstack_node (exit_code=0) (T341285)

cloudservices200[45]-dev have been upgraded. Puppet is not showing errors, but in both hosts it's showing a corrective action on each run:

2023-10-09T16:07:55.754305+00:00 cloudservices2004-dev puppet-agent[74462]: (/Stage[main]/Pdns_server::Db_backups/Dbutils::Statement[pdns_server_db_backups_stmt_1]/Exec[db-statement-pdns_server_db_backups_stmt_1]/returns) executed successfully (corrective)

Change 964858 had a related patch set uploaded (by FNegri; author: FNegri):

[operations/puppet@production] pdns_server: rename privilege for bookworm

https://gerrit.wikimedia.org/r/964858

Change 964858 merged by FNegri:

[operations/puppet@production] pdns_server: rename privilege for bookworm

https://gerrit.wikimedia.org/r/964858

https://gerrit.wikimedia.org/r/964858 fixed the Puppet constant change in cloudservices200[4-5]-dev. I'm proceeding with upgrading the cloudvirt*-dev nodes using the cookbook live_upgrade_openstack.

Mentioned in SAL (#wikimedia-cloud-feed) [2023-10-10T09:43:43Z] <fnegri@cloudcumin1001> START - Cookbook wmcs.openstack.cloudvirt.live_upgrade_openstack (T341285)

Mentioned in SAL (#wikimedia-cloud-feed) [2023-10-10T09:50:22Z] <fnegri@cloudcumin1001> END (PASS) - Cookbook wmcs.openstack.cloudvirt.live_upgrade_openstack (exit_code=0) (T341285)

Mentioned in SAL (#wikimedia-cloud-feed) [2023-10-10T09:56:36Z] <fnegri@cloudcumin1001> START - Cookbook wmcs.openstack.cloudvirt.live_upgrade_openstack (T341285)

Mentioned in SAL (#wikimedia-cloud-feed) [2023-10-10T10:03:20Z] <fnegri@cloudcumin1001> END (PASS) - Cookbook wmcs.openstack.cloudvirt.live_upgrade_openstack (exit_code=0) (T341285)

Mentioned in SAL (#wikimedia-cloud-feed) [2023-10-10T10:52:45Z] <fnegri@cloudcumin1001> START - Cookbook wmcs.openstack.cloudvirt.live_upgrade_openstack (T341285)

Mentioned in SAL (#wikimedia-cloud-feed) [2023-10-10T10:59:43Z] <fnegri@cloudcumin1001> END (PASS) - Cookbook wmcs.openstack.cloudvirt.live_upgrade_openstack (exit_code=0) (T341285)

Mentioned in SAL (#wikimedia-cloud-feed) [2023-10-10T11:00:35Z] <fnegri@cloudcumin1001> START - Cookbook wmcs.openstack.cloudvirt.live_upgrade_openstack (T341285)

Mentioned in SAL (#wikimedia-cloud-feed) [2023-10-10T11:00:42Z] <fnegri@cloudcumin1001> END (FAIL) - Cookbook wmcs.openstack.cloudvirt.live_upgrade_openstack (exit_code=99) (T341285)

Mentioned in SAL (#wikimedia-cloud-feed) [2023-10-10T11:33:08Z] <fnegri@cloudcumin1001> START - Cookbook wmcs.openstack.cloudvirt.live_upgrade_openstack (T341285)

Mentioned in SAL (#wikimedia-cloud-feed) [2023-10-10T11:38:38Z] <fnegri@cloudcumin1001> END (PASS) - Cookbook wmcs.openstack.cloudvirt.live_upgrade_openstack (exit_code=0) (T341285)

Mentioned in SAL (#wikimedia-cloud-feed) [2023-10-10T12:40:57Z] <fnegri@cloudcumin1001> START - Cookbook wmcs.openstack.cloudvirt.live_upgrade_openstack (T341285)

Mentioned in SAL (#wikimedia-cloud-feed) [2023-10-10T12:46:32Z] <fnegri@cloudcumin1001> END (PASS) - Cookbook wmcs.openstack.cloudvirt.live_upgrade_openstack (exit_code=0) (T341285)

Mentioned in SAL (#wikimedia-cloud-feed) [2023-10-11T10:36:33Z] <fnegri@cloudcumin1001> START - Cookbook wmcs.openstack.cloudvirt.live_upgrade_openstack (T341285)

Mentioned in SAL (#wikimedia-cloud-feed) [2023-10-11T10:42:37Z] <fnegri@cloudcumin1001> END (PASS) - Cookbook wmcs.openstack.cloudvirt.live_upgrade_openstack (exit_code=0) (T341285)

Change 965546 had a related patch set uploaded (by FNegri; author: FNegri):

[operations/puppet@production] [openstack] remove hiera override for 2 hosts

https://gerrit.wikimedia.org/r/965546

Change 965546 merged by FNegri:

[operations/puppet@production] [openstack] remove hiera override for 2 hosts

https://gerrit.wikimedia.org/r/965546

Mentioned in SAL (#wikimedia-cloud-feed) [2023-10-12T17:16:41Z] <wm-bot2> fran@wmf3169 START - Cookbook wmcs.openstack.cloudcontrol.upgrade_openstack_node (T341285)

Mentioned in SAL (#wikimedia-cloud-feed) [2023-10-12T17:16:46Z] <wm-bot2> fran@wmf3169 END (ERROR) - Cookbook wmcs.openstack.cloudcontrol.upgrade_openstack_node (exit_code=97) (T341285)

Mentioned in SAL (#wikimedia-cloud-feed) [2023-10-13T08:20:55Z] <wm-bot2> fran@wmf3169 START - Cookbook wmcs.openstack.cloudcontrol.upgrade_openstack_node (T341285)

Mentioned in SAL (#wikimedia-cloud-feed) [2023-10-13T08:30:51Z] <wm-bot2> fran@wmf3169 END (PASS) - Cookbook wmcs.openstack.cloudcontrol.upgrade_openstack_node (exit_code=0) (T341285)

Mentioned in SAL (#wikimedia-cloud-feed) [2023-10-13T08:31:18Z] <wm-bot2> fran@wmf3169 START - Cookbook wmcs.openstack.cloudcontrol.upgrade_openstack_node (T341285)

Mentioned in SAL (#wikimedia-cloud-feed) [2023-10-13T08:41:22Z] <wm-bot2> fran@wmf3169 END (PASS) - Cookbook wmcs.openstack.cloudcontrol.upgrade_openstack_node (exit_code=0) (T341285)

OpenStack .deb packages have now been upgraded to Antelope (using the cookbooks upgrade_openstack_node and live_upgrade_openstack) on all codfw nodes:

  • cloudcontrol2001-dev
  • cloudcontrol2004-dev
  • cloudcontrol2005-dev
  • cloudnet2005-dev
  • cloudnet2006-dev
  • cloudservices[2004-2005]-dev
  • cloudvirt[2001-2006]-dev

These other cloud* nodes did not need an upgrade as they don't include any openstack packages (/etc/apt/sources.list.d/openstack*):

  • cloudbackup[2001-2002].codfw.wmnet
  • cloudcephmon[2004-2006]-dev.codfw.wmnet
  • cloudcephosd[2001-2003]-dev.codfw.wmnet
  • clouddb2002-dev.codfw.wmnet
  • cloudgw[2002-2003]-dev.codfw.wmnet
  • cloudlb[2001-2003]-dev.codfw.wmnet
  • cloudweb2002-dev.wikimedia.org

We now want to test that everything works fine in codfw, before proceeding with upgrading eqiad.

I created two sub-tasks for the eqiad work:

Change 965779 had a related patch set uploaded (by Andrew Bogott; author: Andrew Bogott):

[operations/puppet@production] Neutron/antelope: add policy rule for create_port:device_id

https://gerrit.wikimedia.org/r/965779

Change 965779 merged by Andrew Bogott:

[operations/puppet@production] Neutron/antelope: add policy rule for create_port:device_id

https://gerrit.wikimedia.org/r/965779

fnegri changed the task status from In Progress to Stalled.Nov 14 2023, 5:12 PM
fnegri changed the task status from Stalled to In Progress.
fnegri removed a project: Patch-For-Review.
fnegri moved this task from In progress to Done on the cloud-services-team (FY2023/2024-Q1-Q2) board.

Both codfw and eqiad are now running Antelope!