We've been troubleshooting an issue with a user's magnum cluster, and determined that there was an issue with libvirt connecting to ceph. The VM wouldn't start up and showed as 'paused' state in libvirt; when we tried 'virsh resume' we got this message:
error: Timed out during operation: cannot acquire state change lock (held by monitor=remoteDispatchDomainCreateWithFlags)
There are also hints in the logs that virsh was trying to contact mons at "10.64.20.67", "10.64.20.68", "10.64.20.69" -- those are the addresses of cloudcephmon100[1-3] which no longer exist; the new ceph mons are "10.64.149.19", "10.64.151.5", "10.64.148.27"
These old mon addresses can also be found in the nova_eqiad1 database:
mysql:root@localhost [nova_eqiad1]> select connection_info from block_device_mapping where instance_uuid='45a88d53-717d-4993-a759-d7bb3448
*************************** 1. row ***************************
connection_info: NULL
*************************** 2. row ***************************
connection_info: {"driver_volume_type": "rbd", "data": {"name": "eqiad1-cinder/volume-df1e18b3-6fea-4aea-a386-74806651fa42", "hosts": ["10.64.20.67", "10.64.20.68", "10.64.20.69"], "ports": ["6789", "6789", "6789"], "cluster_name": "ceph", "auth_enabled": true, "auth_username": "eqiad1-cinder", "secret_type": "ceph", "secret_uuid": "9dc683f1-f3d4-4c12-8b8f-f3ffdf36364d", "volume_id": "df1e18b3-6fea-4aea-a386-74806651fa42", "discard": true, "qos_specs": {"write_iops_sec": "500", "iops_sec": "5000", "total_bytes_sec": "200000000"}, "access_mode": "rw", "encrypted": false, "cacheable": false}, "status": "attaching", "instance": "45a88d53-717d-4993-a759-d7bb34480cc4", "attached_at": "2024-06-27T18:33:50.000000", "detached_at": "", "volume_id": "df1e18b3-6fea-4aea-a386-74806651fa42", "serial": "df1e18b3-6fea-4aea-a386-74806651fa42"}I was able to resolve the issue by updating those IPs in the database:
mysql:root@localhost [nova_eqiad1]> update block_device_mapping set connection_info='{"driver_volume_type": "rbd", "data": {"name": "eqiad1-cinder/volume-df1e18b3-6fea-4aea-a386-74806651fa42", "hosts": ["10.64.149.19", "10.64.151.5", "10.64.148.27"], "ports": ["6789", "6789", "6789"], "cluster_name": "ceph", "auth_enabled": true, "auth_username": "eqiad1-cinder", "secret_type": "ceph", "secret_uuid": "9dc683f1-f3d4-4c12-8b8f-f3ffdf36364d", "volume_id": "df1e18b3-6fea-4aea-a386-74806651fa42", "discard": true, "qos_specs": {"write_iops_sec": "500", "iops_sec": "5000", "total_bytes_sec": "200000000"}, "access_mode": "rw", "encrypted": false, "cacheable": false}, "status": "attaching", "instance": "45a88d53-717d-4993-a759-d7bb34480cc4", "attached_at": "2024-06-27T18:33:50.000000", "detached_at": "", "volume_id": "df1e18b3-6fea-4aea-a386-74806651fa42", "serial": "df1e18b3-6fea-4aea-a386-74806651fa42"}' where uuid="2ed1c589-b9ab-4f0f-a5a1-7a73b8b012a8";This is very concerning! Do we have hundreds of VMs with incorrect mon IPs backed in, waiting to fail as soon as we restart or migrate them?