Page MenuHomePhabricator

openstack.trove: failing after the upgrade to wallaby
Closed, ResolvedPublic

Description

A user reported being unable to create a trove DB today:

https://usercontent.irccloud-cdn.com/file/40neC5EE/image.png

The error on logstash says:

a42b856f-4b09-4e19-8695-386a81a99ad3: validate_volume_type() missing 1 required positional argument: 'datastore_version_name'

on host cloudcontrol1005.

That is probably related to a mismatch in the trove versions, looking into the versions installed on that host it seems trove is still in version 14:

root@cloudcontrol1005:~# apt policy trove\*
trove-taskmanager:
  Installed: 1:14.0.0-2~bpo10+1
  Candidate: 1:15.0.0-2~bpo11+1
  Version table:
     1:15.0.0-2~bpo11+1 500
        500 http://mirrors.wikimedia.org/osbpo bullseye-wallaby-backports/main amd64 Packages
 *** 1:14.0.0-2~bpo10+1 100
        100 /var/lib/dpkg/status
trove-api:
  Installed: 1:14.0.0-2~bpo10+1
  Candidate: 1:15.0.0-2~bpo11+1
  Version table:
     1:15.0.0-2~bpo11+1 500
        500 http://mirrors.wikimedia.org/osbpo bullseye-wallaby-backports/main amd64 Packages
 *** 1:14.0.0-2~bpo10+1 100
        100 /var/lib/dpkg/status
trove-doc:
  Installed: (none)
  Candidate: 1:15.0.0-2~bpo11+1
  Version table:
     1:15.0.0-2~bpo11+1 500
        500 http://mirrors.wikimedia.org/osbpo bullseye-wallaby-backports/main amd64 Packages
trove-guestagent:
  Installed: (none)
  Candidate: 1:15.0.0-2~bpo11+1
  Version table:
     1:15.0.0-2~bpo11+1 500
        500 http://mirrors.wikimedia.org/osbpo bullseye-wallaby-backports/main amd64 Packages
trove-conductor:
  Installed: 1:14.0.0-2~bpo10+1
  Candidate: 1:15.0.0-2~bpo11+1
  Version table:
     1:15.0.0-2~bpo11+1 500
        500 http://mirrors.wikimedia.org/osbpo bullseye-wallaby-backports/main amd64 Packages
 *** 1:14.0.0-2~bpo10+1 100
        100 /var/lib/dpkg/status
trove-common:
  Installed: 1:14.0.0-2~bpo10+1
  Candidate: 1:15.0.0-2~bpo11+1
  Version table:
     1:15.0.0-2~bpo11+1 500
        500 http://mirrors.wikimedia.org/osbpo bullseye-wallaby-backports/main amd64 Packages
 *** 1:14.0.0-2~bpo10+1 100
        100 /var/lib/dpkg/status

According to the release notes, for wallaby it's the 15 that's needed:
https://releases.openstack.org/teams/trove.html

So probably missed during the upgrade.

Related Objects

Event Timeline

dcaro triaged this task as High priority.May 4 2022, 11:08 AM
dcaro created this task.
dcaro added a subscriber: Urbanecm_WMF.
dcaro added a subscriber: rook.
dcaro added a subscriber: Andrew.

Upgraded trove in dev to 15, running puppet and syncing the trove db seems to result in the Datastore options vanishing from horizon (previously it couldn't launch, so net zero). Thoughts on how to get those to come back?

I've updated the notes in T304694 to match the changes done in dev.

Feel free to unown if you are not working on it, just trying to avoid stepping on each other's toes :)

Thoughts on how to get those to come back?

I don't have experience with trove, so would try to figure out (on the code probably) where does it decide to show them or not, and keep pulling that thread.

rook removed rook as the assignee of this task.May 5 2022, 2:51 PM

Change 789687 had a related patch set uploaded (by Andrew Bogott; author: Andrew Bogott):

[operations/puppet@production] Horizon: include openstack bpos on cloudweb hosts

https://gerrit.wikimedia.org/r/789687

I think the first thing we should do to debug this is upgrade cloudweb2002-dev to Bullseye to pick up newer client packages. That may resolve this particular issue, and if it doesn't we should probably be upgrading anyway :)

I will try to work on this soon.

I think the first thing we should do to debug this is upgrade cloudweb2002-dev to Bullseye to pick up newer client packages. That may resolve this particular issue, and if it doesn't we should probably be upgrading anyway :)

I take all that back! Moving either wikitech or Striker to Bullseye will be very messy, so we need a different way.

Change 789687 merged by Andrew Bogott:

[operations/puppet@production] Horizon: include openstack bpos on cloudweb hosts

https://gerrit.wikimedia.org/r/789687

I think there are a few things happening here. One of the trove api servers was crashing, which caused intermittent bad behavior in the Horizon UI; restarting all the api agents has improved that.

I'm still not able to actually create database servers, though. The current mystery is in eqiad1 using the openstack cli; no matter what flavor I try to use for the new server it tells me the flavor doesn't exist.

I did also upgrade some client packages on cloudweb2002-dev but I'm not longer sure that client version was an issue at all; will revisit this after I get decent behavior from the cli on a cloudcontrol (where all packages should be up to date).

Change 789904 had a related patch set uploaded (by Andrew Bogott; author: Andrew Bogott):

[operations/puppet@production] Horizon: include Victoria openstack bpos on all cloudweb/labweb hosts

https://gerrit.wikimedia.org/r/789904

Change 789904 abandoned by Andrew Bogott:

[operations/puppet@production] Horizon: include Victoria openstack bpos on all cloudweb/labweb hosts

Reason:

Taavi points out this won't help since we already have the latest client packages in our horizon venv.

https://gerrit.wikimedia.org/r/789904

Andrew claimed this task.

This is working now.

I don't know exactly what the problem was. Something was a bit wonky with the state of code because I found some inconsistent calls (mismatches between function defs and calls) on a cloudcontrol. I did 'apt install --reinstall python3-trove' on all cloudcontrols, and restarted everything and then did a schema upgrade (trove-manage db_upgrade) and that broke things worse.

Then I debugged some more, convinced myself that the new problem was a schema issue so ran trove-manage db_upgrade AGAIN and that fixed the schema issue I was seeing and everything worked.

So... I'm no wiser about how this happened but at least things seem to work now.