Page MenuHomePhabricator

openstack: nova refuses to admit a compute node after a reimage
Closed, ResolvedPublic

Description

While reimaging cloudvirt1031 today we discovered that nova has some kind of "state" that is lost with the reimage, and that it creates conflicts when registering the hypervisor again with the nova-api.

Some upstream docs mention this: https://docs.openstack.org/nova/latest/admin/compute-node-identification.html

Event Timeline

taavi@cloudcontrol1005 ~ $ sudo wmcs-openstack resource provider allocation show 240b2f85-94ce-49eb-8d9d-2559838d0738
+--------------------------------------+------------+----------------------------------------------+------------------+-----------+
| resource_provider                    | generation | resources                                    | project_id       | user_id   |
+--------------------------------------+------------+----------------------------------------------+------------------+-----------+
| e0c50588-6047-4114-8877-35847ddf02e3 |      11597 | {'DISK_GB': 20, 'MEMORY_MB': 512, 'VCPU': 1} | cloudvirt-canary | novaadmin |
+--------------------------------------+------------+----------------------------------------------+------------------+-----------+
taavi@cloudcontrol1005 ~ $ sudo wmcs-openstack resource provider allocation delete 240b2f85-94ce-49eb-8d9d-2559838d0738
taavi@cloudcontrol1005 ~ $ sudo wmcs-openstack resource provider delete e0c50588-6047-4114-8877-35847ddf02e3
root@cloudcontrol1005:~# source novaenv.sh 
root@cloudcontrol1005:~# nova-manage cell_v2 discover_hosts --verbose
Modules with known eventlet monkey patching issues were imported prior to eventlet monkey patching: urllib3. This warning can usually be ignored if the caller is only importing and not executing nova code.
/usr/lib/python3/dist-packages/oslo_policy/policy.py:721: UserWarning: Policy "admin_or_owner":"is_admin:True or project_id:%(project_id)s" was deprecated for removal in 21.0.0. Reason: 
Nova API policies are introducing new default roles with scope_type
capabilities. Old policies are deprecated and silently going to be ignored
in nova 23.0.0 release.
. Its value may be silently ignored in the future.
  warnings.warn(
/usr/lib/python3/dist-packages/oslo_policy/policy.py:770: UserWarning: Policy "os_compute_api:os-attach-interfaces":"rule:admin_or_owner" was deprecated in 21.0.0 in favor of "os_compute_api:os-attach-interfaces:list":"rule:project_reader_or_admin". Reason: 
Nova API policies are introducing new default roles with scope_type
capabilities. Old policies are deprecated and silently going to be ignored
in nova 23.0.0 release.
. Either ensure your deployment is ready for the new default or copy/paste the deprecated policy into your policy file and maintain it manually.
  warnings.warn(deprecated_msg)
/usr/lib/python3/dist-packages/oslo_policy/policy.py:770: UserWarning: Policy "os_compute_api:os-attach-interfaces":"rule:admin_or_owner" was deprecated in 21.0.0 in favor of "os_compute_api:os-attach-interfaces:show":"rule:project_reader_or_admin". Reason: 
Nova API policies are introducing new default roles with scope_type
capabilities. Old policies are deprecated and silently going to be ignored
in nova 23.0.0 release.
. Either ensure your deployment is ready for the new default or copy/paste the deprecated policy into your policy file and maintain it manually.
  warnings.warn(deprecated_msg)
/usr/lib/python3/dist-packages/oslo_policy/policy.py:770: UserWarning: Policy "os_compute_api:os-attach-interfaces":"rule:admin_or_owner" was deprecated in 21.0.0 in favor of "os_compute_api:os-attach-interfaces:create":"rule:project_member_or_admin". Reason: 
Nova API policies are introducing new default roles with scope_type
capabilities. Old policies are deprecated and silently going to be ignored
in nova 23.0.0 release.
. Either ensure your deployment is ready for the new default or copy/paste the deprecated policy into your policy file and maintain it manually.
  warnings.warn(deprecated_msg)
/usr/lib/python3/dist-packages/oslo_policy/policy.py:770: UserWarning: Policy "os_compute_api:os-attach-interfaces":"rule:admin_or_owner" was deprecated in 21.0.0 in favor of "os_compute_api:os-attach-interfaces:delete":"rule:project_member_or_admin". Reason: 
Nova API policies are introducing new default roles with scope_type
capabilities. Old policies are deprecated and silently going to be ignored
in nova 23.0.0 release.
. Either ensure your deployment is ready for the new default or copy/paste the deprecated policy into your policy file and maintain it manually.
  warnings.warn(deprecated_msg)
/usr/lib/python3/dist-packages/oslo_policy/policy.py:770: UserWarning: Policy "os_compute_api:os-deferred-delete":"rule:admin_or_owner" was deprecated in 21.0.0 in favor of "os_compute_api:os-deferred-delete:restore":"rule:project_member_or_admin". Reason: 
Nova API policies are introducing new default roles with scope_type
capabilities. Old policies are deprecated and silently going to be ignored
in nova 23.0.0 release.
. Either ensure your deployment is ready for the new default or copy/paste the deprecated policy into your policy file and maintain it manually.
  warnings.warn(deprecated_msg)
/usr/lib/python3/dist-packages/oslo_policy/policy.py:770: UserWarning: Policy "os_compute_api:os-deferred-delete":"rule:admin_or_owner" was deprecated in 21.0.0 in favor of "os_compute_api:os-deferred-delete:force":"rule:project_member_or_admin". Reason: 
Nova API policies are introducing new default roles with scope_type
capabilities. Old policies are deprecated and silently going to be ignored
in nova 23.0.0 release.
. Either ensure your deployment is ready for the new default or copy/paste the deprecated policy into your policy file and maintain it manually.
  warnings.warn(deprecated_msg)
/usr/lib/python3/dist-packages/oslo_policy/policy.py:770: UserWarning: Policy "os_compute_api:os-floating-ips":"rule:admin_or_owner" was deprecated in 22.0.0 in favor of "os_compute_api:os-floating-ips:add":"rule:project_member_or_admin". Reason: 
Nova API policies are introducing new default roles with scope_type
capabilities. Old policies are deprecated and silently going to be ignored
in nova 23.0.0 release.
. Either ensure your deployment is ready for the new default or copy/paste the deprecated policy into your policy file and maintain it manually.
  warnings.warn(deprecated_msg)
/usr/lib/python3/dist-packages/oslo_policy/policy.py:770: UserWarning: Policy "os_compute_api:os-floating-ips":"rule:admin_or_owner" was deprecated in 22.0.0 in favor of "os_compute_api:os-floating-ips:remove":"rule:project_member_or_admin". Reason: 
Nova API policies are introducing new default roles with scope_type
capabilities. Old policies are deprecated and silently going to be ignored
in nova 23.0.0 release.
. Either ensure your deployment is ready for the new default or copy/paste the deprecated policy into your policy file and maintain it manually.
  warnings.warn(deprecated_msg)
/usr/lib/python3/dist-packages/oslo_policy/policy.py:770: UserWarning: Policy "os_compute_api:os-floating-ips":"rule:admin_or_owner" was deprecated in 22.0.0 in favor of "os_compute_api:os-floating-ips:list":"rule:project_reader_or_admin". Reason: 
Nova API policies are introducing new default roles with scope_type
capabilities. Old policies are deprecated and silently going to be ignored
in nova 23.0.0 release.
. Either ensure your deployment is ready for the new default or copy/paste the deprecated policy into your policy file and maintain it manually.
  warnings.warn(deprecated_msg)
/usr/lib/python3/dist-packages/oslo_policy/policy.py:770: UserWarning: Policy "os_compute_api:os-floating-ips":"rule:admin_or_owner" was deprecated in 22.0.0 in favor of "os_compute_api:os-floating-ips:create":"rule:project_member_or_admin". Reason: 
Nova API policies are introducing new default roles with scope_type
capabilities. Old policies are deprecated and silently going to be ignored
in nova 23.0.0 release.
. Either ensure your deployment is ready for the new default or copy/paste the deprecated policy into your policy file and maintain it manually.
  warnings.warn(deprecated_msg)
/usr/lib/python3/dist-packages/oslo_policy/policy.py:770: UserWarning: Policy "os_compute_api:os-floating-ips":"rule:admin_or_owner" was deprecated in 22.0.0 in favor of "os_compute_api:os-floating-ips:show":"rule:project_reader_or_admin". Reason: 
Nova API policies are introducing new default roles with scope_type
capabilities. Old policies are deprecated and silently going to be ignored
in nova 23.0.0 release.
. Either ensure your deployment is ready for the new default or copy/paste the deprecated policy into your policy file and maintain it manually.
  warnings.warn(deprecated_msg)
/usr/lib/python3/dist-packages/oslo_policy/policy.py:770: UserWarning: Policy "os_compute_api:os-floating-ips":"rule:admin_or_owner" was deprecated in 22.0.0 in favor of "os_compute_api:os-floating-ips:delete":"rule:project_member_or_admin". Reason: 
Nova API policies are introducing new default roles with scope_type
capabilities. Old policies are deprecated and silently going to be ignored
in nova 23.0.0 release.
. Either ensure your deployment is ready for the new default or copy/paste the deprecated policy into your policy file and maintain it manually.
  warnings.warn(deprecated_msg)
/usr/lib/python3/dist-packages/oslo_policy/policy.py:770: UserWarning: Policy "os_compute_api:os-instance-actions":"rule:admin_or_owner" was deprecated in 21.0.0 in favor of "os_compute_api:os-instance-actions:list":"rule:project_reader_or_admin". Reason: 
Nova API policies are introducing new default roles with scope_type
capabilities. Old policies are deprecated and silently going to be ignored
in nova 23.0.0 release.
. Either ensure your deployment is ready for the new default or copy/paste the deprecated policy into your policy file and maintain it manually.
  warnings.warn(deprecated_msg)
/usr/lib/python3/dist-packages/oslo_policy/policy.py:770: UserWarning: Policy "os_compute_api:os-instance-actions":"rule:admin_or_owner" was deprecated in 21.0.0 in favor of "os_compute_api:os-instance-actions:show":"rule:project_reader_or_admin". Reason: 
Nova API policies are introducing new default roles with scope_type
capabilities. Old policies are deprecated and silently going to be ignored
in nova 23.0.0 release.
. Either ensure your deployment is ready for the new default or copy/paste the deprecated policy into your policy file and maintain it manually.
  warnings.warn(deprecated_msg)
/usr/lib/python3/dist-packages/oslo_policy/policy.py:770: UserWarning: Policy "os_compute_api:os-used-limits":"rule:admin_api" was deprecated in 21.0.0 in favor of "os_compute_api:limits:other_project":"rule:context_is_admin". Reason: 
Nova API policies are introducing new default roles with scope_type
capabilities. Old policies are deprecated and silently going to be ignored
in nova 23.0.0 release.
. Either ensure your deployment is ready for the new default or copy/paste the deprecated policy into your policy file and maintain it manually.
  warnings.warn(deprecated_msg)
/usr/lib/python3/dist-packages/oslo_policy/policy.py:770: UserWarning: Policy "os_compute_api:os-rescue":"rule:admin_or_owner" was deprecated in 21.0.0 in favor of "os_compute_api:os-unrescue":"rule:project_member_or_admin". Reason: 
Rescue/Unrescue API policies are made granular with new policy
for unrescue and keeping old policy for rescue.
. Either ensure your deployment is ready for the new default or copy/paste the deprecated policy into your policy file and maintain it manually.
  warnings.warn(deprecated_msg)
/usr/lib/python3/dist-packages/oslo_policy/policy.py:770: UserWarning: Policy "os_compute_api:os-volumes":"rule:admin_or_owner" was deprecated in 22.0.0 in favor of "os_compute_api:os-volumes:list":"rule:project_reader_or_admin". Reason: 
Nova API policies are introducing new default roles with scope_type
capabilities. Old policies are deprecated and silently going to be ignored
in nova 23.0.0 release.
. Either ensure your deployment is ready for the new default or copy/paste the deprecated policy into your policy file and maintain it manually.
  warnings.warn(deprecated_msg)
/usr/lib/python3/dist-packages/oslo_policy/policy.py:770: UserWarning: Policy "os_compute_api:os-volumes":"rule:admin_or_owner" was deprecated in 22.0.0 in favor of "os_compute_api:os-volumes:create":"rule:project_member_or_admin". Reason: 
Nova API policies are introducing new default roles with scope_type
capabilities. Old policies are deprecated and silently going to be ignored
in nova 23.0.0 release.
. Either ensure your deployment is ready for the new default or copy/paste the deprecated policy into your policy file and maintain it manually.
  warnings.warn(deprecated_msg)
/usr/lib/python3/dist-packages/oslo_policy/policy.py:770: UserWarning: Policy "os_compute_api:os-volumes":"rule:admin_or_owner" was deprecated in 22.0.0 in favor of "os_compute_api:os-volumes:detail":"rule:project_reader_or_admin". Reason: 
Nova API policies are introducing new default roles with scope_type
capabilities. Old policies are deprecated and silently going to be ignored
in nova 23.0.0 release.
. Either ensure your deployment is ready for the new default or copy/paste the deprecated policy into your policy file and maintain it manually.
  warnings.warn(deprecated_msg)
/usr/lib/python3/dist-packages/oslo_policy/policy.py:770: UserWarning: Policy "os_compute_api:os-volumes":"rule:admin_or_owner" was deprecated in 22.0.0 in favor of "os_compute_api:os-volumes:show":"rule:project_reader_or_admin". Reason: 
Nova API policies are introducing new default roles with scope_type
capabilities. Old policies are deprecated and silently going to be ignored
in nova 23.0.0 release.
. Either ensure your deployment is ready for the new default or copy/paste the deprecated policy into your policy file and maintain it manually.
  warnings.warn(deprecated_msg)
/usr/lib/python3/dist-packages/oslo_policy/policy.py:770: UserWarning: Policy "os_compute_api:os-volumes":"rule:admin_or_owner" was deprecated in 22.0.0 in favor of "os_compute_api:os-volumes:delete":"rule:project_member_or_admin". Reason: 
Nova API policies are introducing new default roles with scope_type
capabilities. Old policies are deprecated and silently going to be ignored
in nova 23.0.0 release.
. Either ensure your deployment is ready for the new default or copy/paste the deprecated policy into your policy file and maintain it manually.
  warnings.warn(deprecated_msg)
/usr/lib/python3/dist-packages/oslo_policy/policy.py:770: UserWarning: Policy "os_compute_api:os-volumes":"rule:admin_or_owner" was deprecated in 22.0.0 in favor of "os_compute_api:os-volumes:snapshots:list":"rule:project_reader_or_admin". Reason: 
Nova API policies are introducing new default roles with scope_type
capabilities. Old policies are deprecated and silently going to be ignored
in nova 23.0.0 release.
. Either ensure your deployment is ready for the new default or copy/paste the deprecated policy into your policy file and maintain it manually.
  warnings.warn(deprecated_msg)
/usr/lib/python3/dist-packages/oslo_policy/policy.py:770: UserWarning: Policy "os_compute_api:os-volumes":"rule:admin_or_owner" was deprecated in 22.0.0 in favor of "os_compute_api:os-volumes:snapshots:create":"rule:project_member_or_admin". Reason: 
Nova API policies are introducing new default roles with scope_type
capabilities. Old policies are deprecated and silently going to be ignored
in nova 23.0.0 release.
. Either ensure your deployment is ready for the new default or copy/paste the deprecated policy into your policy file and maintain it manually.
  warnings.warn(deprecated_msg)
/usr/lib/python3/dist-packages/oslo_policy/policy.py:770: UserWarning: Policy "os_compute_api:os-volumes":"rule:admin_or_owner" was deprecated in 22.0.0 in favor of "os_compute_api:os-volumes:snapshots:detail":"rule:project_reader_or_admin". Reason: 
Nova API policies are introducing new default roles with scope_type
capabilities. Old policies are deprecated and silently going to be ignored
in nova 23.0.0 release.
. Either ensure your deployment is ready for the new default or copy/paste the deprecated policy into your policy file and maintain it manually.
  warnings.warn(deprecated_msg)
/usr/lib/python3/dist-packages/oslo_policy/policy.py:770: UserWarning: Policy "os_compute_api:os-volumes":"rule:admin_or_owner" was deprecated in 22.0.0 in favor of "os_compute_api:os-volumes:snapshots:show":"rule:project_reader_or_admin". Reason: 
Nova API policies are introducing new default roles with scope_type
capabilities. Old policies are deprecated and silently going to be ignored
in nova 23.0.0 release.
. Either ensure your deployment is ready for the new default or copy/paste the deprecated policy into your policy file and maintain it manually.
  warnings.warn(deprecated_msg)
/usr/lib/python3/dist-packages/oslo_policy/policy.py:770: UserWarning: Policy "os_compute_api:os-volumes":"rule:admin_or_owner" was deprecated in 22.0.0 in favor of "os_compute_api:os-volumes:snapshots:delete":"rule:project_member_or_admin". Reason: 
Nova API policies are introducing new default roles with scope_type
capabilities. Old policies are deprecated and silently going to be ignored
in nova 23.0.0 release.
. Either ensure your deployment is ready for the new default or copy/paste the deprecated policy into your policy file and maintain it manually.
  warnings.warn(deprecated_msg)
Found 2 cell mappings.
Skipping cell0 since it does not contain hosts.
Getting computes from cell: 1ee5b233-6b94-40f5-b3d2-fc1a89c13274
Checking host mapping for compute host 'cloudvirt1031': d022a6b5-c6fb-4e28-96c2-a418f876318d
Creating host mapping for compute host 'cloudvirt1031': d022a6b5-c6fb-4e28-96c2-a418f876318d
Found 1 unmapped computes in cell: 1ee5b233-6b94-40f5-b3d2-fc1a89c13274
aborrero updated the task description. (Show Details)
aborrero moved this task from Backlog to Radar/observer on the User-aborrero board.

Typically on a reimage we don't need to remove or rediscover hosts; the pool is based on hostname so the reimaged hosts should rejoin without any issues.

I'm able to schedule VMs on 1031, and nova thinks that the compute service there is 'up'. So I think I don't understand what this bug is about :/

Typically on a reimage we don't need to remove or rediscover hosts; the pool is based on hostname so the reimaged hosts should rejoin without any issues.

Ok, now that I've read the docs I see that what I just said is no longer true. So maybe this was an issue but is now fixed?

OK, in summary, I think taavi fixed it. What I would do next time is

$ # stop puppet on the cloudvirt, stop nova-compute on the cloudvirt

$ wmcs-openstack compute service forget <hostname>

$ # reimage the host

$ sudo nova-manage cell_v2 discover_hosts

That might work, or it might not!

Update, after trying the procedure described above by @Andrew I get:

Feb 19 12:40:36 cloudvirt1032 nova-compute[27450]: 2024-02-19 12:40:36.391 27450 ERROR nova.compute.manager [None req-9ca69125-9502-458f-a0a4-e6537f717f70 - - - - - -] Could not retrieve compute node resource provider 77d7d013-f1b2-4ca2-8ba5-fac20a1c092f and therefore unable to error out any instances stuck in BUILDING state. Error: Failed to retrieve allocations for resource provider 77d7d013-f1b2-4ca2-8ba5-fac20a1c092f: {"errors": [{"status": 404, "title": "Not Found", "detail": "The resource could not be found.\n\n Resource provider '77d7d013-f1b2-4ca2-8ba5-fac20a1c092f' not found: No resource provider with uuid 77d7d013-f1b2-4ca2-8ba5-fac20a1c092f found  ", "request_id": "req-acc50e4e-df44-4061-acb4-3bacea60b317"}]}: nova.exception.ResourceProviderAllocationRetrievalFailed: Failed to retrieve allocations for resource provider 77d7d013-f1b2-4ca2-8ba5-fac20a1c092f: {"errors": [{"status": 404, "title": "Not Found", "detail": "The resource could not be found.\n\n Resource provider '77d7d013-f1b2-4ca2-8ba5-fac20a1c092f' not found: No resource provider with uuid 77d7d013-f1b2-4ca2-8ba5-fac20a1c092f found  ", "request_id": "req-acc50e4e-df44-4061-acb4-3bacea60b317"}]}

Feb 19 12:40:38 cloudvirt1032 nova-compute[27450]: 2024-02-19 12:40:38.199 27450 ERROR nova.compute.resource_tracker [None req-9ca69125-9502-458f-a0a4-e6537f717f70 - - - - - -] Skipping removal of allocations for deleted instances: Failed to retrieve allocations for resource provider 77d7d013-f1b2-4ca2-8ba5-fac20a1c092f: {"errors": [{"status": 404, "title": "Not Found", "detail": "The resource could not be found.\n\n Resource provider '77d7d013-f1b2-4ca2-8ba5-fac20a1c092f' not found: No resource provider with uuid 77d7d013-f1b2-4ca2-8ba5-fac20a1c092f found  ", "request_id": "req-686ff94f-2317-4429-90e8-08ddb1890ed4"}]}: nova.exception.ResourceProviderAllocationRetrievalFailed: Failed to retrieve allocations for resource provider 77d7d013-f1b2-4ca2-8ba5-fac20a1c092f: {"errors": [{"status": 404, "title": "Not Found", "detail": "The resource could not be found.\n\n Resource provider '77d7d013-f1b2-4ca2-8ba5-fac20a1c092f' not found: No resource provider with uuid 77d7d013-f1b2-4ca2-8ba5-fac20a1c092f found  ", "request_id": "req-686ff94f-2317-4429-90e8-08ddb1890ed4"}]}

Feb 19 12:40:38 cloudvirt1032 nova-compute[27450]: 2024-02-19 12:40:38.403 27450 ERROR nova.scheduler.client.report [None req-9ca69125-9502-458f-a0a4-e6537f717f70 - - - - - -] [req-e8e3b06c-4c96-4a32-b2b2-709e24deb589] Failed to create resource provider record in placement API for UUID 77d7d013-f1b2-4ca2-8ba5-fac20a1c092f. Got 409: {"errors": [{"status": 409, "title": "Conflict", "detail": "There was a conflict when trying to complete your request.\n\n Conflicting resource provider name: cloudvirt1032.eqiad.wmnet already exists.  ", "request_id": "req-e8e3b06c-4c96-4a32-b2b2-709e24deb589"}]}.

Feb 19 12:40:38 cloudvirt1032 nova-compute[27450]: 2024-02-19 12:40:38.404 27450 ERROR nova.compute.manager [None req-9ca69125-9502-458f-a0a4-e6537f717f70 - - - - - -] Error updating resources for node cloudvirt1032.eqiad.wmnet.: nova.exception.ResourceProviderCreationFailed: Failed to create resource provider cloudvirt1032.eqiad.wmnet

I think the proper fix here is to persist the nova ID of each host via puppet.

  • for new hosts, generate the id maybe based on the hostname
  • for existing hosts, capture the current id, and store in a hiera override

Is this because we split the cookbook in three steps? (as in, would it be enough to store the value on the first step, and reuse in the next ones?)

This is a recent change in openstack, apparently, see https://docs.openstack.org/nova/latest/admin/compute-node-identification.html

I think there are 2 problems here:

In fact, after dealing with the first item, we may not need T357765 after all, or at least not the whole of it.

Change 1005065 had a related patch set uploaded (by Arturo Borrero Gonzalez; author: Arturo Borrero Gonzalez):

[operations/puppet@production] openstack: nova-compute: persist compute node id

https://gerrit.wikimedia.org/r/1005065

Change 1005065 merged by Arturo Borrero Gonzalez:

[operations/puppet@production] openstack: nova-compute: persist compute node id

https://gerrit.wikimedia.org/r/1005065

aborrero claimed this task.

The patch solved the problem!

Change #1017124 had a related patch set uploaded (by Andrew Bogott; author: Andrew Bogott):

[operations/puppet@production] openstack: nova-compute: persist compute node id for cloudvirt1031

https://gerrit.wikimedia.org/r/1017124

Change #1017124 merged by Andrew Bogott:

[operations/puppet@production] openstack: nova-compute: persist compute node id for cloudvirt1031

https://gerrit.wikimedia.org/r/1017124