Page MenuHomePhabricator

wmcs-drain-hypervisor is broken
Closed, ResolvedPublic

Description

1060 is only in the maintenance aggregate:

taavi@cloudcontrol1006 ~ $ os hypervisor show cloudvirt1060.eqiad.wmnet
+---------------------+--------------------------------------+
| Field               | Value                                |
+---------------------+--------------------------------------+
| aggregates          | ['maintenance']                      |
| cpu_info            | None                                 |
| host_ip             | 10.64.149.12                         |
| host_time           | 14:34:46                             |
| hypervisor_hostname | cloudvirt1060.eqiad.wmnet            |
| hypervisor_type     | QEMU                                 |
| hypervisor_version  | 7002007                              |
| id                  | b5a14b7c-c4a7-4a1c-8c09-7eccdb235b9b |
| load_average        | 10.47, 9.82, 9.50                    |
| service_host        | cloudvirt1060                        |
| service_id          | 8ec33070-080a-467c-9dc5-b75483d18c2f |
| state               | up                                   |
| status              | enabled                              |
| uptime              | 66 days, 18:49                       |
| users               | 1                                    |
+---------------------+--------------------------------------+

However, the migration script does not manage to drain the host. Logs like the following are printed for each VM. The retry does not work, either.

wmcs-drain-hypervisor: 2024-01-15 14:33:28,669: INFO: Migrating control-plane (9cd703fb-7f53-4458-937a-9e34c16726f8)
wmcs-drain-hypervisor: 2024-01-15 14:33:31,072: INFO: current status is ACTIVE; waiting for it to change to ['MIGRATING']
wmcs-drain-hypervisor: 2024-01-15 14:33:32,371: INFO: current status is MIGRATING; waiting for it to change to ['ACTIVE']
wmcs-drain-hypervisor: 2024-01-15 14:33:34,944: INFO: instance 9cd703fb-7f53-4458-937a-9e34c16726f8 (control-plane) is now on host cloudvirt1060 with status ACTIVE
wmcs-drain-hypervisor: 2024-01-15 14:33:34,944: WARNING: control-plane (9cd703fb-7f53-4458-937a-9e34c16726f8) didn't actually migrate, got scheduled on the same hypervisor. Will try again!

Event Timeline

The migrations "failed" it seems.

taavi@cloudcontrol1006 ~ $ os server migration list --changes-since 2024-01-15T00:00:00Z
+-------+--------------------------------------+---------------------------+---------------------------+----------------+---------------+-----------+--------+--------------------------------------+------------+------------+----------------+----------------------------+----------------------------+
|    Id | UUID                                 | Source Node               | Dest Node                 | Source Compute | Dest Compute  | Dest Host | Status | Server UUID                          | Old Flavor | New Flavor | Type           | Created At                 | Updated At                 |
+-------+--------------------------------------+---------------------------+---------------------------+----------------+---------------+-----------+--------+--------------------------------------+------------+------------+----------------+----------------------------+----------------------------+
| 35700 | 2f97733c-5364-452a-8e23-d7fd6c7baea0 | cloudvirt1060.eqiad.wmnet | cloudvirt1046.eqiad.wmnet | cloudvirt1060  | cloudvirt1046 | None      | failed | 4f193824-85d8-4369-8be3-c8b96abbd71d |        148 |        148 | live-migration | 2024-01-15T14:37:00.000000 | 2024-01-15T14:37:05.000000 |

Seems related to the recent PKI changes:

2024-01-15 14:36:58.102 236191 ERROR nova.virt.libvirt.driver [None req-31287e76-435f-4780-838f-87bcad67a60d novaadmin admin - - default default] [instance: 8bb7461b-2cb1-4a23-9405-183955a3fb4e] Live Migration failure: authentication failed: Failed to verify peer's certificate: libvirt.libvirtError: authentication failed: Failed to verify peer's certificate

And there are matching entries in the cloudvirt1046 logs:

Jan 15 14:28:54 cloudvirt1046 libvirtd[2662926]: Unable to verify TLS peer: No certificate was found.
Jan 15 14:28:54 cloudvirt1046 libvirtd[2662926]: Certificate check failed Unable to verify TLS peer: No certificate was found.
Jan 15 14:28:54 cloudvirt1046 libvirtd[2662926]: authentication failed: Failed to verify peer's certificate

Where does nova specify the client certificate to use?

Where does nova specify the client certificate to use?

Here, probably:

/etc/nova/nova-compute.conf
live_migration_uri=qemu://%s.eqiad.wmnet/system?pkipath=/var/lib/nova

This can also be reproduced via the virsh CLI:

taavi@cloudvirt1060 ~ $ sudo virsh --connect qemu://cloudvirt1046.eqiad.wmnet/system?pkipath=/var/lib/nova
2024-01-15 14:54:21.329+0000: 302461: info : libvirt version: 9.0.0, package: 9.0.0-4 (Debian)
2024-01-15 14:54:21.329+0000: 302461: info : hostname: cloudvirt1060
2024-01-15 14:54:21.329+0000: 302461: warning : virNetTLSContextCheckCertificate:1086 : Certificate check failed Certificate failed validation: The certificate hasn't got a known issuer.
error: failed to connect to the hypervisor
error: authentication failed: Failed to verify peer's certificate

Change 990724 had a related patch set uploaded (by Majavah; author: Majavah):

[operations/puppet@production] P:openstack: nova::compute: restart libvirt api after changing TLS certs

https://gerrit.wikimedia.org/r/990724

Change 990948 had a related patch set uploaded (by Majavah; author: Majavah):

[operations/puppet@production] P:openstack: nova::compute: include certificate chain

https://gerrit.wikimedia.org/r/990948

Change 990724 merged by Majavah:

[operations/puppet@production] P:openstack: nova::compute: restart libvirt api after changing TLS certs

https://gerrit.wikimedia.org/r/990724

Change 990948 merged by Majavah:

[operations/puppet@production] P:openstack: nova::compute: include certificate chain

https://gerrit.wikimedia.org/r/990948