Page MenuHomePhabricator

openstack: consider removing references to old hardware from the database
Closed, ResolvedPublic

Description

We still have references to old (already decommissioned) hardware somewhere in the database, as the prometheus-openstack-exporter reports data for them.

Example:

aborrero@cloudcontrol1007:~$ curl localhost:12345/metrics -o metrics.prom
aborrero@cloudcontrol1007:~$ grep cloudvirt1001 metrics.prom
openstack_placement_resource_allocation_ratio{hostname="cloudvirt1001.eqiad.wmnet",resourcetype="DISK_GB"} 1.5
openstack_placement_resource_allocation_ratio{hostname="cloudvirt1001.eqiad.wmnet",resourcetype="MEMORY_MB"} 1
openstack_placement_resource_allocation_ratio{hostname="cloudvirt1001.eqiad.wmnet",resourcetype="VCPU"} 4
openstack_placement_resource_reserved{hostname="cloudvirt1001.eqiad.wmnet",resourcetype="DISK_GB"} 0
openstack_placement_resource_reserved{hostname="cloudvirt1001.eqiad.wmnet",resourcetype="MEMORY_MB"} 512
openstack_placement_resource_reserved{hostname="cloudvirt1001.eqiad.wmnet",resourcetype="VCPU"} 0
openstack_placement_resource_total{hostname="cloudvirt1001.eqiad.wmnet",resourcetype="DISK_GB"} 2015
openstack_placement_resource_total{hostname="cloudvirt1001.eqiad.wmnet",resourcetype="MEMORY_MB"} 386952
openstack_placement_resource_total{hostname="cloudvirt1001.eqiad.wmnet",resourcetype="VCPU"} 48
openstack_placement_resource_usage{hostname="cloudvirt1001.eqiad.wmnet",resourcetype="DISK_GB"} 52
openstack_placement_resource_usage{hostname="cloudvirt1001.eqiad.wmnet",resourcetype="MEMORY_MB"} 2048
openstack_placement_resource_usage{hostname="cloudvirt1001.eqiad.wmnet",resourcetype="VCPU"} 4

This is likely somewhere in the placement database, but I couldn't find where:

aborrero@cloudcontrol1007:~$ sudo mysql -u root
MariaDB [(none)]> show databases;
+--------------------+
| Database           |
+--------------------+
| cinder             |
| designate          |
| eqiad1_ceph_backy  |
| eqiad1_heat        |
| eqiad1_magnum      |
| glance             |
| information_schema |
| keystone           |
| mysql              |
| neutron            |
| nova_api_eqiad1    |
| nova_cell0_eqiad1  |
| nova_eqiad1        |
| performance_schema |
| placement          |
| trove_eqiad1       |
+--------------------+
16 rows in set (0.001 sec)

MariaDB [(none)]> use placement;
Reading table information for completion of table and column names
You can turn off this feature to get a quicker startup with -A

Database changed
MariaDB [placement]> show tables;
+------------------------------+
| Tables_in_placement          |
+------------------------------+
| alembic_version              |
| allocations                  |
| consumer_types               |
| consumers                    |
| inventories                  |
| placement_aggregates         |
| projects                     |
| resource_classes             |
| resource_provider_aggregates |
| resource_provider_traits     |
| resource_providers           |
| traits                       |
| users                        |
+------------------------------+
13 rows in set (0.000 sec)

However, at least some parts of openstack knows that these hosts don't exists:

aborrero@cloudcontrol1007:~$ sudo wmcs-openstack hypervisor list | grep cloudvirt1001
[.. nothing ..]

The impact is just cosmetic. We get some panels with empty data in grafana which is a bit annoying but also harmless.

Related Objects

Event Timeline

aborrero created this task.

Following up from T340611, my next best guess is that the openstack exporter performs some caching? That seems likely if the OS API returns correct data (i.e. no old hosts)

Following up from T340611, my next best guess is that the openstack exporter performs some caching? That seems likely if the OS API returns correct data (i.e. no old hosts)

I couldn't find such cache. I suspect of the DB because I'm not aware of any procedure we do to cleanup it when we decommission hardware.

Following up from T340611, my next best guess is that the openstack exporter performs some caching? That seems likely if the OS API returns correct data (i.e. no old hosts)

I couldn't find such cache. I suspect of the DB because I'm not aware of any procedure we do to cleanup it when we decommission hardware.

Yeah that must be it then; I'm definitely out of my depth here obviously, but happy to help with the Prometheus side of things if needed

Removing hardware records from the DB seems a little bit dangerous as that could leave dangling references elsewhere (for instance in the action log which keeps track of any previous actions a VM took, including a reference to where the VM was at the time.)

This seems like a bug in the exporter, it should really be able to tell the difference between nodes the exist and nodes that don't. I'm going to look in the code a bit to see what it's doing.

Ok, I think I found them! These deleted hosts can be cleaned up with

openstack resource provider list
openstack resource provider show --allocations
openstack resource provider allocation delete
openstack resource provider delete

I think this is now cleaned up and resolved for now. In the future, I suspect that deleting canary VMs before deleting hypervisors will prevent them from showing up here, but openstack resource provider delete might be needed.