Page MenuHomePhabricator

VM nova records attached to incorrect cloudcephmon IPs
Closed, ResolvedPublic

Description

We've been troubleshooting an issue with a user's magnum cluster, and determined that there was an issue with libvirt connecting to ceph. The VM wouldn't start up and showed as 'paused' state in libvirt; when we tried 'virsh resume' we got this message:

error: Timed out during operation: cannot acquire state change lock (held by monitor=remoteDispatchDomainCreateWithFlags)

There are also hints in the logs that virsh was trying to contact mons at "10.64.20.67", "10.64.20.68", "10.64.20.69" -- those are the addresses of cloudcephmon100[1-3] which no longer exist; the new ceph mons are "10.64.149.19", "10.64.151.5", "10.64.148.27"

These old mon addresses can also be found in the nova_eqiad1 database:

mysql:root@localhost [nova_eqiad1]> select connection_info from block_device_mapping where instance_uuid='45a88d53-717d-4993-a759-d7bb3448
*************************** 1. row ***************************
connection_info: NULL
*************************** 2. row ***************************
connection_info: {"driver_volume_type": "rbd", "data": {"name": "eqiad1-cinder/volume-df1e18b3-6fea-4aea-a386-74806651fa42", "hosts": ["10.64.20.67", "10.64.20.68", "10.64.20.69"], "ports": ["6789", "6789", "6789"], "cluster_name": "ceph", "auth_enabled": true, "auth_username": "eqiad1-cinder", "secret_type": "ceph", "secret_uuid": "9dc683f1-f3d4-4c12-8b8f-f3ffdf36364d", "volume_id": "df1e18b3-6fea-4aea-a386-74806651fa42", "discard": true, "qos_specs": {"write_iops_sec": "500", "iops_sec": "5000", "total_bytes_sec": "200000000"}, "access_mode": "rw", "encrypted": false, "cacheable": false}, "status": "attaching", "instance": "45a88d53-717d-4993-a759-d7bb34480cc4", "attached_at": "2024-06-27T18:33:50.000000", "detached_at": "", "volume_id": "df1e18b3-6fea-4aea-a386-74806651fa42", "serial": "df1e18b3-6fea-4aea-a386-74806651fa42"}

I was able to resolve the issue by updating those IPs in the database:

mysql:root@localhost [nova_eqiad1]> update block_device_mapping set connection_info='{"driver_volume_type": "rbd", "data": {"name": "eqiad1-cinder/volume-df1e18b3-6fea-4aea-a386-74806651fa42", "hosts": ["10.64.149.19", "10.64.151.5", "10.64.148.27"], "ports": ["6789", "6789", "6789"], "cluster_name": "ceph", "auth_enabled": true, "auth_username": "eqiad1-cinder", "secret_type": "ceph", "secret_uuid": "9dc683f1-f3d4-4c12-8b8f-f3ffdf36364d", "volume_id": "df1e18b3-6fea-4aea-a386-74806651fa42", "discard": true, "qos_specs": {"write_iops_sec": "500", "iops_sec": "5000", "total_bytes_sec": "200000000"}, "access_mode": "rw", "encrypted": false, "cacheable": false}, "status": "attaching", "instance": "45a88d53-717d-4993-a759-d7bb34480cc4", "attached_at": "2024-06-27T18:33:50.000000", "detached_at": "", "volume_id": "df1e18b3-6fea-4aea-a386-74806651fa42", "serial": "df1e18b3-6fea-4aea-a386-74806651fa42"}' where uuid="2ed1c589-b9ab-4f0f-a5a1-7a73b8b012a8";

This is very concerning! Do we have hundreds of VMs with incorrect mon IPs backed in, waiting to fail as soon as we restart or migrate them?

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald Transcript
Andrew triaged this task as High priority.Jan 13 2025, 4:43 PM
Andrew updated the task description. (Show Details)

Originally, the VM was in ERROR state, and was showing the log:

| fault                               | {'code': 500, 'created': '2025-01-13T02:31:06Z', 'message': 'libvirtError', 'details': 'Traceback (most recent call last):\n  File   |                                                                                                                                                                         
|                                     | "/usr/lib/python3/dist-packages/nova/compute/manager.py", line 203, in decorated_function\n    return function(self, context, *args, |                                                                                                                                                                         
|                                     | **kwargs)\n           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^\n  File "/usr/lib/python3/dist-packages/nova/compute/manager.py",     |                                                                                                                                                                         
|                                     | line 4248, in reboot_instance\n    do_reboot_instance(context, instance, block_device_info, reboot_type)\n  File                     |                                                                                                                                                                         
|                                     | "/usr/lib/python3/dist-packages/oslo_concurrency/lockutils.py", line 414, in inner\n    return f(*args, **kwargs)\n                  |                                                                                                                                                                         
|                                     | ^^^^^^^^^^^^^^^^^^\n  File "/usr/lib/python3/dist-packages/nova/compute/manager.py", line 4246, in do_reboot_instance\n              |                                                                                                                                                                         
|                                     | self._reboot_instance(context, instance, block_device_info,\n  File "/usr/lib/python3/dist-packages/nova/compute/manager.py", line   |                                                                                                                                                                         
|                                     | 4321, in _reboot_instance\n    with excutils.save_and_reraise_exception() as ctxt:\n  File "/usr/lib/python3/dist-                   |                                                                                                                                                                         
|                                     | packages/oslo_utils/excutils.py", line 227, in __exit__\n    self.force_reraise()\n  File "/usr/lib/python3/dist-                    |                                                                                                                                                                         
|                                     | packages/oslo_utils/excutils.py", line 200, in force_reraise\n    raise self.value\n  File "/usr/lib/python3/dist-                   |                                                                                                                                                                         
|                                     | packages/nova/compute/manager.py", line 4313, in _reboot_instance\n    self.driver.reboot(context, instance,\n  File                 |                                                                                                                                                                         
|                                     | "/usr/lib/python3/dist-packages/nova/virt/libvirt/driver.py", line 3885, in reboot\n    return self._hard_reboot(context, instance,  |                                                                                                                                                                         
|                                     | network_info,\n           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^\n  File "/usr/lib/python3/dist-                         |                                                                                                                                                                         
|                                     | packages/nova/virt/libvirt/driver.py", line 4020, in _hard_reboot\n    self._create_guest_with_network(\n  File                      |                                                                                                                                                                         
|                                     | "/usr/lib/python3/dist-packages/nova/virt/libvirt/driver.py", line 7785, in _create_guest_with_network\n    with                     |                                                                                                                                                                         
|                                     | excutils.save_and_reraise_exception():\n  File "/usr/lib/python3/dist-packages/oslo_utils/excutils.py", line 227, in __exit__\n      |                                                                                                                                                                         
|                                     | self.force_reraise()\n  File "/usr/lib/python3/dist-packages/oslo_utils/excutils.py", line 200, in force_reraise\n    raise          |                                                                                                                                                                         
|                                     | self.value\n  File "/usr/lib/python3/dist-packages/nova/virt/libvirt/driver.py", line 7763, in _create_guest_with_network\n    guest |                                                                                                                                                                         
|                                     | = self._create_guest(\n            ^^^^^^^^^^^^^^^^^^^\n  File "/usr/lib/python3/dist-packages/nova/virt/libvirt/driver.py", line    |                                                                                                                                                                         
|                                     | 7702, in _create_guest\n    guest.launch(pause=pause)\n  File "/usr/lib/python3/dist-packages/nova/virt/libvirt/guest.py", line 167, |                                                                                                                                                                         
|                                     | in launch\n    with excutils.save_and_reraise_exception():\n  File "/usr/lib/python3/dist-packages/oslo_utils/excutils.py", line     |                                                                                                                                                                         
|                                     | 227, in __exit__\n    self.force_reraise()\n  File "/usr/lib/python3/dist-packages/oslo_utils/excutils.py", line 200, in             |                                                                                                                                                                         
|                                     | force_reraise\n    raise self.value\n  File "/usr/lib/python3/dist-packages/nova/virt/libvirt/guest.py", line 165, in launch\n       |                                                                                                                                                                         
|                                     | return self._domain.createWithFlags(flags)\n           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^\n  File "/usr/lib/python3/dist-           |                                                                                                                                                                         
|                                     | packages/eventlet/tpool.py", line 193, in doit\n    result = proxy_call(self._autowrap, f, *args, **kwargs)\n                        |                                                                                                                                                                         
|                                     | ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^\n  File "/usr/lib/python3/dist-packages/eventlet/tpool.py", line 151, in proxy_call\n |                                                                                                                                                                         
|                                     | rv = execute(f, *args, **kwargs)\n         ^^^^^^^^^^^^^^^^^^^^^^^^^^^\n  File "/usr/lib/python3/dist-packages/eventlet/tpool.py",   |                                                                                                                                                                         
|                                     | line 132, in execute\n    six.reraise(c, e, tb)\n  File "/usr/lib/python3/dist-packages/six.py", line 719, in reraise\n    raise     |                                                                                                                                                                         
|                                     | value\n  File "/usr/lib/python3/dist-packages/eventlet/tpool.py", line 86, in tworker\n    rv = meth(*args, **kwargs)\n              |                                                                                                                                                                         
|                                     | ^^^^^^^^^^^^^^^^^^^^^\n  File "/usr/lib/python3/dist-packages/libvirt.py", line 1409, in createWithFlags\n    raise                  |                                                                                                                                                                         
|                                     | libvirtError(\'virDomainCreateWithFlags() failed\')\nlibvirt.libvirtError: internal error: process exited while connecting to        |                                                                                                                                                                         
|                                     | monitor: 2025-01-13T02:31:06.289274Z qemu-system-x86_64: -blockdev {"driver":"rbd","pool":"eqiad1-cinder","image":"volume-df1e18b3-  |                                                                                                                                                                         
|                                     | 6fea-4aea-a386-                                                                                                                      |                                                                                                                                                                         
|                                     | 74806651fa42","server":[{"host":"10.64.20.69","port":"6789"},{"host":"10.64.20.68","port":"6789"},{"host":"10.64.20.67","port":"6789 |                                                                                                                                                                         
|                                     | "}],"user":"eqiad1-cinder","auth-client-required":["cephx","none"],"key-secret":"libvirt-1-storage-auth-secret0","node-              |                                                                                                                                                                         
|                                     | name":"libvirt-1-storage","cache":{"direct":false,"no-flush":false},"auto-read-only":true,"discard":"unmap"}: error connecting:      |                                                                                                                                                                         
|                                     | Connection timed out\n'}

Then I stopped it (using the cli), and started it again, and then started showing the time out logs.

Many VMs have the old IPs and one or more of the new ones. Those VMs don't seem to be in danger. That leaves only VMs without IPs of new mons.

mysql:root@localhost [nova_eqiad1]> select connection_info from block_device_mapping where deleted_at is Null and connection_info like "%10.64.20.67%" and connection_info not like "%10.64.149.19%" and connection_info not like "%10.64.151.5%" and connection_info not like "%10.64.148.27%";

shows 147 rows.

For my tests, I'm experimenting on the little-used 9da6e185-0068-4bf0-9fcf-56440625d285/paws-puppetserver-1:

mysql:root@localhost [nova_eqiad1]> select connection_info from block_device_mapping where deleted_at is Null and instance_uuid='9da6e185-0068-4bf0-9fcf-56440625d285';
+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| connection_info                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        |
+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| NULL                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                   |
| {"driver_volume_type": "rbd", "data": {"name": "eqiad1-cinder/volume-ff515db3-f61b-4bba-81a2-3aa565fd9db9", "hosts": ["10.64.20.69", "10.64.20.68", "10.64.20.67"], "ports": ["6789", "6789", "6789"], "cluster_name": "ceph", "auth_enabled": true, "auth_username": "eqiad1-cinder", "secret_type": "ceph", "secret_uuid": "9dc683f1-f3d4-4c12-8b8f-f3ffdf36364d", "volume_id": "ff515db3-f61b-4bba-81a2-3aa565fd9db9", "discard": true, "qos_specs": {"write_iops_sec": "500", "iops_sec": "5000", "total_bytes_sec": "200000000"}, "access_mode": "rw", "encrypted": false, "cacheable": false}, "status": "attaching", "instance": "9da6e185-0068-4bf0-9fcf-56440625d285", "attached_at": "2024-08-06T17:32:51.000000", "detached_at": "", "volume_id": "ff515db3-f61b-4bba-81a2-3aa565fd9db9", "serial": "ff515db3-f61b-4bba-81a2-3aa565fd9db9"} |
+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
2 rows in set (0.001 sec)

After a soft reboot, the vm shows nova state 'REBOOT' and virsh state 'paused'. Seems to be stuck there.

So it seems we have a real problem.

I tried migrating a host 3b85fc66-ff29-486b-9eed-1c6893a4fc40/metricsinfra-puppetserver-1 before anything was wrong with it, and it seems to have corrected itself as part of the migration:

connection_info: {"driver_volume_type": "rbd", "data": {"name": "eqiad1-cinder/volume-aa090019-0d74-42cd-bf44-24cdbbd73d6d", "hosts": ["10.64.148.27", "10.64.149.19", "10.64.151.5"], "ports": ["6789", "6789", "6789"], "cluster_name": "ceph", "auth_enabled": true, "auth_username": "eqiad1-cinder", "secret_type": "ceph", "secret_uuid": "9dc683f1-f3d4-4c12-8b8f-f3ffdf36364d", "volume_id": "aa090019-0d74-42cd-bf44-24cdbbd73d6d", "discard": true, "qos_specs": {"write_iops_sec": "500", "iops_sec": "5000", "total_bytes_sec": "200000000"}, "access_mode": "rw", "encrypted": false, "cacheable": false}, "status": "attaching", "instance": "3b85fc66-ff29-486b-9eed-1c6893a4fc40", "attached_at": "2025-01-13T17:44:51.000000", "detached_at": "", "volume_id": "aa090019-0d74-42cd-bf44-24cdbbd73d6d", "serial": "aa090019-0d74-42cd-bf44-24cdbbd73d6d"}

So that is likely the way to repair this

cold migration seems to resolve the issue without problem (other than system reboot). live migration seems to get things stuck in 'migration' state, even though it fixes the mon IPs beforehand.

Also this seems to work w/out a reboot:

openstack server migrate --shared-migration --wait 871ab13f-51df-4bc8-917f-0828ac98b3c1

I'm doing that to all affected VMs, we'll see how it goes!

Also this seems to work w/out a reboot:

openstack server migrate --shared-migration --wait 871ab13f-51df-4bc8-917f-0828ac98b3c1

I'm doing that to all affected VMs, we'll see how it goes!

Whoops! turns out --shared migration reboots VMs too, which caused the attached toolforge nfs outage :(

As far as I can tell, a cold migration is the only reliable way to repair this. Repairing the connection_info in the database makes it possible to reboot the VMs, but does not enable live migration; live migration just tries and fails.

So... right now we have 85 VMs that won't come back up if rebooted, and that will block drainage of their respective hypervisors. I will schedule reboots.

I'm sitting on the following email which I don't love but which is probably needed:

Due to a latent configuration error, many VMs need to be rebooted. The affected VMs are working fine now, but if restarted they will start with an incorrect block storage config and be unable to launch, requiring manual repair.

I am planning to migrate + reboot all VMs listed at the end of this email next Tuesday, which will rebuild and repair the broken config. If you find one of these VMs in an inconsistent state in the meantime, please contact a WMCS root and refer them to the related task[0] which explains the somewhat tedious process of revival.

Some of the VMs to be rebooted are nfs servers, which may cause service interruptions for projects relying on those servers. Specifically:

1b001a01-0a37-4d81-9dd6-1339b6391f37 paws-nfs-1.paws
gc18ab12b-fa18-4a29-a074-40535930f6b0 k8s-nfs.quarry
ga1cb92ab-6083-4465-81a9-a283918f13eb toolsbeta-nfs-3.toolsbeta
gcf6dc802-f95c-48c1-a07b-8791185a5e45 fastcci-nfs-1.fastcci

The rest of the VMs that need reboots are:

g4b932658-2b4d-4157-b62e-1e73faaed8aa quarry-db-02.trove
g14be24bb-a36b-4627-9254-d02a0987db83 data.wikipathways
g14be24bb-a36b-4627-9254-d02a0987db83 data.wikipathways
g7dea739e-7f9a-48ed-8169-e88300c77092 gitlab-runner-1003.devtools
g9aa7e1f6-1e53-4c54-814a-1c6e7b8e55bb dannyb.wildcat
g07543105-1336-4f7f-ae04-5c7989a0bfb5 larynx-be-01.text-to-speech
g1bff1756-ce9a-40b6-8783-a01910dff1b6 storage.qrank
gb26f51d1-b1e6-46a6-877d-3d43acc1ccac wikiapiary.wikiapiary
g3c53fc19-f19f-4c6a-8b23-ab8edb149994 libup-db02.trove
g7b1a49d7-2df1-4011-942d-80da14e38f7d buttercup.wikifunctions
g4222d13d-aea4-4168-8542-03b5e91f2af8 tools-imagebuilder-2.tools
g23e4ccb3-6146-482d-ad4e-1e83e78d1482 deployment-snapshot05.deployment-prep
gc3be1a8e-94f4-4db8-b3d2-99f74f034ea5 loggerdiscordbot.trove
g94f64547-3537-4eae-b7c8-f33f0d5e58bd cloudinfra-internal-puppetserver-1.cloudinfra
g32e12f3c-bc0c-4dc1-8ada-abd6e03d1f75 tools-puppetserver-01.tools
gd277044f-a5db-4939-b1e4-5a142116f9d0 quarry-k8s.trove
gc4f71bdf-4d88-4f2c-a0bc-d358975a91eb free3.dwl
g0126dc3f-02c9-4b07-8550-de103724fdc3 ifis.trove
gba0e8a43-352b-48cd-bc86-8f2374fd7447 logging-opensearch-hdd-01.logging
g8071f964-80be-41b0-b042-14024965f1ba tools-docker-registry-7.tools
gf0766d97-01f3-49c9-be51-5431b2c87579 logging-opensearch-ssd-01.logging
gc4e76fe0-82c5-4188-8e35-0f63b6d4a216 logging-opensearch-ssd-02.logging
gca6525d1-1863-4ab0-b0ef-07c35d063324 taxonbot3.dwl
g0038b497-9331-4493-a688-5c85830494e9 tools-docker-registry-8.tools
g27c89a15-ee59-4cb2-81f7-845d582ccc88 mwv-puppetserver-01.mediawiki-vagrant
ga8b2ff57-9841-434b-b233-d8c0064ddb1b mailman-puppetserver-1.mailman
gfb7fba39-5cef-44ea-a6f3-5b7145f446e8 legacy.wikiapiary
gfb7fba39-5cef-44ea-a6f3-5b7145f446e8 legacy.wikiapiary
gb93776fd-b69b-459e-9960-ee69a6b52732 cloudinfra-cloudvps-puppetserver-1.cloudinfra
g337b88c3-9009-402d-a7d5-182b1b027fbb copypatrol-dev-db-01.trove
geb811cae-2a12-43eb-9d1f-33ee250fe51a copypatrol-prod-db-01.trove
gc738d3bb-8bd4-4807-8651-78698923385e thistle.trove
gc02583d4-fa94-4a5c-a4a2-1a50872d45a5 mariadb.trove
g59406e53-be1b-4033-8866-8abe57b8040f venus.wikisp
gdb1e3259-163c-4a3f-9b5b-a8a37bbb6f5d db01.trove
g037795b1-0d21-41e8-840e-ac379426d1e9 osmit-tm4.osmit
geb6f591f-69b2-4f16-a7eb-3740b5fd074f db.933ad3ff1e264aada56e6bc3ed9e08f3
g50ea5f92-8cb9-4e21-bc65-6fe0b6dd9550 superset-127-jxhvhh7bzlrl-master-0.superset
g2faa7a26-c56a-4219-b658-b680212cb64c superset-127-jxhvhh7bzlrl-node-0.superset
g70a287ed-3071-4df9-ba1a-ac512cf8749a superset-127-jxhvhh7bzlrl-node-1.superset
gd182fffb-9f32-41de-96cf-ce5cc5fc95c2 citefix-db.trove
gfff879b9-8300-4a11-9cf6-3d424be9ffa3 cloudinfra-db04.cloudinfra
g47789d5d-4777-455f-8e00-4d24507fe5e7 worker-1.spacemedia
ge9b6fd4c-d5a3-40dc-968a-ddfcf2222389 worker-2.spacemedia
gf50f9084-d034-4dae-84fb-80980868335a tools-elastic-4.tools
g16ce1807-dd8e-4b87-b4ba-558d2719bbe5 tools-elastic-6.tools
g39737a21-257d-4267-9aa6-d2c411501b5f peony-database.globaleducation
gc8ccdc13-f8b0-4e37-9a2b-795f0bff9f59 videocuttool-bookworm-new.videocuttool
gc8ccdc13-f8b0-4e37-9a2b-795f0bff9f59 videocuttool-bookworm-new.videocuttool
gbfad7fbd-53db-4604-aa38-19ffa3e3da02 harbordb.trove
g93742dcf-1d18-4c28-8ba0-24ea3219f144 wmcz-stats-test03.wmcz-stats
g17c5890b-d783-4f3c-938d-0a3ff2e2e3bc wmcz-stats-wikinside01.wmcz-stats
gf37f17cf-ee65-4a40-888e-3500120d566d ml-testing.machine-learning
g7fd6ada6-2d78-443b-84ce-57aec7cd7a4a ia-upload-prod2.wikisource
gf75a5008-1b83-4bc2-8cb7-c74f5262d053 pcc-db1002.puppet-diffs
gd1818008-69c9-4a7b-b642-baeb313df663 patchdemo4-production.catalyst
gc9bff3a7-f77b-4fea-a31c-b4fb77c4e9e2 deployment-jobrunner05.deployment-prep
gde63de80-8f02-47e0-9ad2-cfe407ca99be deployment-mediawiki14.deployment-prep
g5ff144a5-bb68-44b7-9d0f-39df6b9d3063 deployment-mediawiki13.deployment-prep
g93c9a448-d01c-4229-8cf3-29b8a0b765cb deployment-parsoid14.deployment-prep
gfe1ff5d8-813c-4d06-ae3a-46c05715d305 tools-services-06.tools
gc99e6edf-89ab-4300-b637-80e11e55438e tools-harbordb.trove
g9b5de945-4495-4516-8672-13e525acc557 encoding01.video
gf9bee5ff-85f5-4cdf-876a-3f4e2e3f3c8e encoding03.video
g5a7dcadf-c88e-4d4b-9e44-42e3d636c4ba encoding05.video
g85d30118-6397-4bb1-a652-31bb48a1fb2b traffic-puppetserver-bookworm.traffic
g5ed094b0-5a2f-42d3-9fe3-a93237351475 encoding06.video
gfcd555ae-c19c-4b2e-8785-8a9447b263d8 humaniki-prod-bw.wikidumpparse
gde830403-c93b-4340-9467-d1d8f7ea1628 sample-complex-app-db.trove
g3a4d671b-5aef-49ef-b6b4-5cb0c7d00e38 toolsbeta-harbor-2.toolsbeta
g421dd73e-8c34-4962-a4e2-61c8997b27d1 quarry-127a-g4ndvpkr5sro-master-0.quarry
ga1a12a26-55cd-41cd-9696-74773b4db7f0 deployment-wikikube-v127-zwallxbnux67-master-0.deployment-prep
g8b99ac12-660d-4267-b841-2d458f8dd9a4 deployment-wikikube-v127-zwallxbnux67-node-0.deployment-prep
g65c14b90-8b59-498a-851d-b3c4259fb134 paws-127a-m3mctzr7itba-master-0.paws
g3599d6d5-89c2-40ba-95dc-c48c066bd56b paws-127a-m3mctzr7itba-node-2.paws
g15f8fdeb-46f9-4853-bdd8-e8c074dce507 paws-127a-m3mctzr7itba-node-3.paws
gecac332c-d805-482c-ba71-93838c73b100 paws-127a-m3mctzr7itba-node-1.paws
g8373fb59-3b0a-4259-868b-1b4b81c3d10c paws-127a-m3mctzr7itba-node-4.paws
gcaf6628a-f9f7-4f1b-a42a-3cc79b12a758 paws-127a-m3mctzr7itba-node-0.paws
gcaf6628a-f9f7-4f1b-a42a-3cc79b12a758 paws-127a-m3mctzr7itba-node-0.paws
gecac332c-d805-482c-ba71-93838c73b100 paws-127a-m3mctzr7itba-node-1.paws

[0] https://phabricator.wikimedia.org/T383583

Hmm... I wonder where is it storing that data to do the live migration, maybe it reads the xml from libvirt? If so, editing that xml would work? (can that be done on the fly?)

It's weird too that the VMs in that state, have the wrong hosts in their libvirt xml volumes:

# from tools-elastic-4
      <source protocol='rbd' name='eqiad1-compute/f50f9084-d034-4dae-84fb-80980868335a_disk' index='1'>
        <host name='10.64.20.69' port='6789'/>
        <host name='10.64.20.68' port='6789'/>
        <host name='10.64.20.67' port='6789'/>
      </source>

fyi. it comes from the command used to start the VM, so maybe that's what's using to do the live migration, worth testing the xml thingie though:

libvirt+ 2904398 19.2  3.2 20037788 17008604 ?   Sl    2024 54373:34 /usr/bin/qemu-system-x86_64 -name guest=i-000bb030,debug-threads=on -S ... -blockdev {"driver":"rbd","pool":"eqiad1-compute","image":"f50f9084-d034-4dae-84fb-80980868335a_disk","server":[{"host":"10.64.20.69","port":"6789"},{"host":"10.64.20.68","port":"6789"},{"host":"10.64.20.67","port":"6789"}],"user":"eqiad1-compute","auth-client-required":["cephx","none"],"key-secret":"libvirt-1-storage-auth-secret0","node-name":"libvirt-1-storage","cache":{"direct":false,"no-flush":false},"auto-read-only":true,"discard":"unmap"} ...

But still working right now. So my guess is that it's only used to get the list of osds to connect to, after that it just gets all the info from the osds themselves.

editing the xml file does not seem to make a difference, much to my surprise

editing the xml file does not seem to make a difference, much to my surprise

Is it possible that OpenStack cached the old value somewhere? Have you tried restarting Nova after changing the XML?

Mentioned in SAL (#wikimedia-cloud-feed) [2025-01-14T20:43:48Z] <andrew@cloudcumin1001> START - Cookbook wmcs.openstack.cloudvirt.drain on host 'cloudvirt1055.eqiad.wmnet' (T383583)

Mentioned in SAL (#wikimedia-cloud-feed) [2025-01-14T20:53:58Z] <andrew@cloudcumin1001> END (FAIL) - Cookbook wmcs.openstack.cloudvirt.drain (exit_code=99) on host 'cloudvirt1055.eqiad.wmnet' (T383583)

Mentioned in SAL (#wikimedia-cloud-feed) [2025-01-14T20:54:55Z] <andrew@cloudcumin1001> START - Cookbook wmcs.openstack.cloudvirt.safe_reboot on hosts matched by 'D{cloudvirt1055.eqiad.wmnet}' (T383583)

Mentioned in SAL (#wikimedia-cloud-feed) [2025-01-14T20:55:37Z] <andrew@cloudcumin1001> END (FAIL) - Cookbook wmcs.openstack.cloudvirt.safe_reboot (exit_code=99) on hosts matched by 'D{cloudvirt1055.eqiad.wmnet}' (T383583)

Mentioned in SAL (#wikimedia-cloud-feed) [2025-01-14T21:26:32Z] <andrew@cloudcumin1001> START - Cookbook wmcs.openstack.cloudvirt.unset_maintenance (T383583)

Mentioned in SAL (#wikimedia-cloud-feed) [2025-01-14T21:26:41Z] <andrew@cloudcumin1001> END (PASS) - Cookbook wmcs.openstack.cloudvirt.unset_maintenance (exit_code=0) (T383583)

@Andrew, video2commons is broken and I wonder if it's caused by the NFS incident. Should I wait for the maintenance before doing anything on the video encoder instances, as they are in your list?

I doubt that the video2commons issue is related to this task; the only symptom I've seen for this task is a VM being stuck in a 'rebooting' or 'shutoff' state and refusing to start up.

Mentioned in SAL (#wikimedia-cloud) [2025-01-21T13:42:23Z] <andrewbogott> migrating/rebooting VMs as per earlier email, T383583

All affected VMs are now corrected.

This leaves the followup of understanding how to prevent this the next time we get new cloudcephmons.

fnegri changed the task status from Open to In Progress.Jan 24 2025, 4:41 PM
fnegri assigned this task to Andrew.

I think this is mostly done, though we just found two more affected VMs today (T384642 and T384711) that required a manual migration.

how to prevent this the next time we get new cloudcephmons

This probably deserves its own separate task.

Mentioned in SAL (#wikimedia-cloud) [2025-01-31T16:13:19Z] <JJMC89> copypatrol-backend-dev-01 hard reboot for T383583

Mentioned in SAL (#wikimedia-cloud) [2025-01-31T16:17:07Z] <JJMC89> copypatrol-backend-prod-01 hard reboot for T383583