Page MenuHomePhabricator

At least one VM is live on a host that openstack disagrees with
Closed, ResolvedPublic

Description

tools-sgegrid-0918 is currently running via libvirt on cloudvirt1016. However, cloudvirt106 shows this in the log:

Feb 22 16:27:11 cloudvirt1016 nova-compute: 2021-02-22 16:27:11.831 67114 WARNING nova.compute.resource_tracker [req-bb5dffc4-6cba-455b-971e-8a0a216d4d6e - - - - -] Instance 1469938f-3dd9-4585-a538-8050cb57a44b has been moved to another host cloudvirt1036(cloudvirt1036.eqiad.wmnet). There are allocations remaining against the source host that might need to be removed: {'resources': {'VCPU': 4, 'MEMORY_MB': 8192, 'DISK_GB': 80}}.

This is incorrect! cloudvirt1036 thinks it is there and worries about its absence very briefly:

Feb 22 15:35:55 cloudvirt1036 nova-compute: 2021-02-22 15:35:55.364 18521 WARNING nova.compute.manager [req-7600eb69-23f1-4af6-ad19-c836f736c374 - - - - -] While synchro
nizing instance power states, found 57 instances in the database and 56 instances on the hypervisor.
Feb 22 15:35:55 cloudvirt1036 nova-compute: 2021-02-22 15:35:55.502 18521 WARNING nova.compute.manager [-] [instance: 1469938f-3dd9-4585-a538-8050cb57a44b] Instance is u
nexpectedly not found. Ignore.

I have not found anything else like this so far (but it might be worth a quick audit). I used cumin to find the libvirt instance location.

Event Timeline

Bstorm triaged this task as Medium priority.Feb 22 2021, 5:02 PM
Bstorm created this task.

Mentioned in SAL (#wikimedia-cloud) [2021-02-22T17:14:58Z] <bstorm> restarting nova-compute on cloudvirt1016 and cloudvirt1036 in case it helps T275411

That restart, predictably did nothing. :)

Mentioned in SAL (#wikimedia-cloud) [2021-02-22T19:03:17Z] <bstorm> depooled tools-sgeexec-0918 T275411

Note: nova sees the power state as "no state" for this VM, which is not surprising :)

Mentioned in SAL (#wikimedia-cloud) [2021-02-22T19:05:22Z] <bstorm> shutting down tools-sgeexec-0918 (with openstack to see what happens) T275411

It told me Error: You are not allowed to shut off instance: tools-sgeexec-0918 😁

Mentioned in SAL (#wikimedia-cloud) [2021-02-22T19:07:52Z] <bstorm> shutting down tools-sgeexec-0918 with the VM's command line (not libvirt directly yet) T275411

Mentioned in SAL (#wikimedia-cloud) [2021-02-22T19:09:56Z] <bstorm> hard rebooted tools-sgeexec-0918 from openstack T275411

Hard reboot did the magical thing. Openstack now knows the power state and appears to be controlling the VM.

[bstorm@cloudvirt1036]:~ $ sudo virsh list | grep i-00002cba
 104   i-00002cba   running

This means that there is a very easy way to audit for this state. If the power is set to "no state" it's in limbo.

Bstorm claimed this task.

Since NOSTATE is a good proxy for finding VMs in this state, I ran:

[bstorm@cloudcontrol1003]:~ $ sudo wmcs-openstack server list --all-projects --long | grep NOSTATE
| d603b2e0-7b8b-462f-b74d-c782c2d34fea | fullstackd-20210110160929            | BUILD     | scheduling | NOSTATE     |                                                      | debian-10.0-buster (deprecated 2021-02-22)  | 6b67c8a1-6356-464d-a885-0576d7263e51 | g2.cores1.ram2.disk20         | f5b0c7bc-b09a-41af-8812-b50ed99dbec8 |                   | None               |                                 |
| de419714-6d1f-4811-ae3b-0849aa600271 | canary1028-01                        | ERROR     | None       | NOSTATE     |                                                      | debian-10.0-buster                          | 64351116-a53e-4a62-8866-5f0058d89c2b | cloudvirt-canary              | 72116845-7941-4d3d-9eb1-11084b7b1927 |                   | None               | description=''canary VM''       |
| 56dec8ba-09da-4ea4-bcd7-9f62d2781e56 | canary1020-01                        | ERROR     | None       | NOSTATE     |                                                      | debian-10.0-buster (deprecated 2021-02-22)  | 6b67c8a1-6356-464d-a885-0576d7263e51 | cloudvirt-canary              | 72116845-7941-4d3d-9eb1-11084b7b1927 |                   | None               | description='canary VM'         |
| 6db4b8e6-e750-4140-86b5-c2d329b13499 | canary1019-01                        | ERROR     | None       | NOSTATE     |                                                      | debian-10.0-buster (deprecated 2021-02-22)  | 6b67c8a1-6356-464d-a885-0576d7263e51 | cloudvirt-canary              | 72116845-7941-4d3d-9eb1-11084b7b1927 |                   | None               | description='canary VM'

And I think that confirms no other VMs are like this (since those are either in build or relate to T275376: Cloudvirt instances failing to start). I'll double check those don't need cleanup.