Page MenuHomePhabricator

CloudVPS: nova messing with instance disks
Closed, ResolvedPublic

Description

I just found out today while working on T240851: CloudVPS: stretch base images fails to boot that nova is somehow messing with instance disks:

In cloudvirt1025 /var/log/nova/nova-compute.log:

aborrero@cloudvirt1025:~ $ sudo grep ERROR /var/log/nova/nova-compute.log | grep DiskNotFound: | wc -l
605
aborrero@cloudvirt1025:~ $ sudo grep ERROR /var/log/nova/nova-compute.log | grep DiskNotFound: | tail
2019-12-16 16:29:24.553 273212 ERROR nova.compute.manager DiskNotFound: No disk at /var/lib/nova/instances/d5f84421-a4ee-423d-92a5-8871f8612666/disk
2019-12-16 16:30:26.481 273212 ERROR nova.compute.manager DiskNotFound: No disk at /var/lib/nova/instances/d5f84421-a4ee-423d-92a5-8871f8612666/disk
2019-12-16 16:31:26.556 273212 ERROR nova.compute.manager DiskNotFound: No disk at /var/lib/nova/instances/d5f84421-a4ee-423d-92a5-8871f8612666/disk
2019-12-16 16:32:28.483 273212 ERROR nova.compute.manager DiskNotFound: No disk at /var/lib/nova/instances/d5f84421-a4ee-423d-92a5-8871f8612666/disk
2019-12-16 16:33:28.543 273212 ERROR nova.compute.manager DiskNotFound: No disk at /var/lib/nova/instances/d5f84421-a4ee-423d-92a5-8871f8612666/disk
2019-12-16 16:34:30.500 273212 ERROR nova.compute.manager DiskNotFound: No disk at /var/lib/nova/instances/d5f84421-a4ee-423d-92a5-8871f8612666/disk
2019-12-16 16:35:31.535 273212 ERROR nova.compute.manager DiskNotFound: No disk at /var/lib/nova/instances/d5f84421-a4ee-423d-92a5-8871f8612666/disk
2019-12-16 16:36:32.583 273212 ERROR nova.compute.manager DiskNotFound: No disk at /var/lib/nova/instances/d5f84421-a4ee-423d-92a5-8871f8612666/disk
2019-12-16 16:37:32.551 273212 ERROR nova.compute.manager DiskNotFound: No disk at /var/lib/nova/instances/d5f84421-a4ee-423d-92a5-8871f8612666/disk
2019-12-16 16:38:32.516 273212 ERROR nova.compute.manager DiskNotFound: No disk at /var/lib/nova/instances/d5f84421-a4ee-423d-92a5-8871f8612666/disk
[..]

That VM is cloudinfra-internal-puppetmaster01 and should be running on cloudvirt1026, so weird.

Same for example in cloudvirt1026:

aborrero@cloudvirt1026:~ $ sudo grep d56d6f32-5d96-477c-bfae-45cd5fbc47e2 /var/log/nova/nova-compute.log | tail
2019-12-16 16:32:36.718 49690 ERROR nova.compute.manager DiskNotFound: No disk at /var/lib/nova/instances/d56d6f32-5d96-477c-bfae-45cd5fbc47e2/disk
2019-12-16 16:33:38.444 49690 ERROR nova.compute.manager DiskNotFound: No disk at /var/lib/nova/instances/d56d6f32-5d96-477c-bfae-45cd5fbc47e2/disk
2019-12-16 16:34:37.572 49690 ERROR nova.compute.manager DiskNotFound: No disk at /var/lib/nova/instances/d56d6f32-5d96-477c-bfae-45cd5fbc47e2/disk
2019-12-16 16:35:38.986 49690 ERROR nova.compute.manager DiskNotFound: No disk at /var/lib/nova/instances/d56d6f32-5d96-477c-bfae-45cd5fbc47e2/disk
2019-12-16 16:36:38.268 49690 ERROR nova.compute.manager DiskNotFound: No disk at /var/lib/nova/instances/d56d6f32-5d96-477c-bfae-45cd5fbc47e2/disk
2019-12-16 16:37:39.541 49690 ERROR nova.compute.manager DiskNotFound: No disk at /var/lib/nova/instances/d56d6f32-5d96-477c-bfae-45cd5fbc47e2/disk
2019-12-16 16:38:41.476 49690 ERROR nova.compute.manager DiskNotFound: No disk at /var/lib/nova/instances/d56d6f32-5d96-477c-bfae-45cd5fbc47e2/disk
2019-12-16 16:39:42.436 49690 ERROR nova.compute.manager DiskNotFound: No disk at /var/lib/nova/instances/d56d6f32-5d96-477c-bfae-45cd5fbc47e2/disk
2019-12-16 16:40:43.657 49690 ERROR nova.compute.manager DiskNotFound: No disk at /var/lib/nova/instances/d56d6f32-5d96-477c-bfae-45cd5fbc47e2/disk
2019-12-16 16:41:44.343 49690 ERROR nova.compute.manager DiskNotFound: No disk at /var/lib/nova/instances/d56d6f32-5d96-477c-bfae-45cd5fbc47e2/disk
aborrero@cloudvirt1026:~ $ sudo grep d56d6f32-5d96-477c-bfae-45cd5fbc47e2 /var/log/nova/nova-compute.log | wc -l
607

That VM is tools-static-12, which is running on cloudvirt1018.

I bet this is happening all over the fleet. Not sure if this is important or relevant in any way though.

Event Timeline

aborrero triaged this task as Medium priority.Dec 16 2019, 4:49 PM
aborrero moved this task from Inbox to Soon! on the cloud-services-team (Kanban) board.

This happens when a VM is migrated with the wmcs cold migration script without being undefined in virsh.

This is probably resolved but @JHedden will double-check

Cleaned up all the stale entries with virsh undefine <domain id>