Page MenuHomePhabricator

Failures when draining certain VMs with attached cinder volumes (coibot-2)
Closed, ResolvedPublic

Description

When draining old cloudvirts I've run into 4 VMs that refuse to migrate. I'm focusing on one of them for now, coibot-2.

At the moment the primary symptom is that anytime the associated volume (coibot_xlinkbot_data) is attached to a VM (which happens as part of a migration) it gets stuck in 'reserved' until a timeout is reached. Associated cinder-wsgi logs look like this:

Unhandled error: OSError: write error#0122025-05-20 13:51:26.507 2507379 ERROR cinder OSError: write error#0122025-05-20 13:51:26.507 2507379 ERROR cinder

and

2025-05-20 13:51:26.492 2507379 ERROR cinder.api.v3.attachments [req-ba82215f-be3d-4bbe-b137-bd30974c35ac req-85f487cf-56ca-49b4-820d-c529d92d34f1 andrew linkwatcher - - default default] Unable to update the attachment.: oslo_messaging.exceptions.MessagingTimeout: Timed out waiting for a reply to message ID 8e783d0471864169acad16c7ba387b1b#012
2025-05-20 13:51:26.492 2507379 ERROR cinder.api.v3.attachments Traceback (most recent call last):#012
2025-05-20 13:51:26.492 2507379 ERROR cinder.api.v3.attachments   File "/usr/lib/python3/dist-packages/cinder/api/v3/attachments.py", line 250, in update#012
2025-05-20 13:51:26.492 2507379 ERROR cinder.api.v3.attachments     self.volume_api.attachment_update(context,#012
2025-05-20 13:51:26.492 2507379 ERROR cinder.api.v3.attachments   File "/usr/lib/python3/dist-packages/decorator.py", line 232, in fun#012
2025-05-20 13:51:26.492 2507379 ERROR cinder.api.v3.attachments     return caller(func, *(extras + args), **kw)#012
2025-05-20 13:51:26.492 2507379 ERROR cinder.api.v3.attachments            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^#012
2025-05-20 13:51:26.492 2507379 ERROR cinder.api.v3.attachments   File "/usr/lib/python3/dist-packages/cinder/coordination.py", line 239, in _synchronized#012
2025-05-20 13:51:26.492 2507379 ERROR cinder.api.v3.attachments     return f(*a, **k)#012
2025-05-20 13:51:26.492 2507379 ERROR cinder.api.v3.attachments            ^^^^^^^^^^#012
2025-05-20 13:51:26.492 2507379 ERROR cinder.api.v3.attachments   File "/usr/lib/python3/dist-packages/cinder/volume/api.py", line 2576, in attachment_update#012
2025-05-20 13:51:26.492 2507379 ERROR cinder.api.v3.attachments     self.volume_rpcapi.attachment_update(ctxt,#012
2025-05-20 13:51:26.492 2507379 ERROR cinder.api.v3.attachments   File "/usr/lib/python3/dist-packages/cinder/rpc.py", line 197, in _wrapper#012
2025-05-20 13:51:26.492 2507379 ERROR cinder.api.v3.attachments     return f(self, *args, **kwargs)#012
2025-05-20 13:51:26.492 2507379 ERROR cinder.api.v3.attachments            ^^^^^^^^^^^^^^^^^^^^^^^^#012
2025-05-20 13:51:26.492 2507379 ERROR cinder.api.v3.attachments   File "/usr/lib/python3/dist-packages/cinder/volume/rpcapi.py", line 485, in attachment_update#012
2025-05-20 13:51:26.492 2507379 ERROR cinder.api.v3.attachments     return cctxt.call(ctxt,#012
2025-05-20 13:51:26.492 2507379 ERROR cinder.api.v3.attachments            ^^^^^^^^^^^^^^^^#012
2025-05-20 13:51:26.492 2507379 ERROR cinder.api.v3.attachments   File "/usr/lib/python3/dist-packages/oslo_messaging/rpc/client.py", line 190, in call#012
2025-05-20 13:51:26.492 2507379 ERROR cinder.api.v3.attachments     result = self.transport._send(#012
2025-05-20 13:51:26.492 2507379 ERROR cinder.api.v3.attachments              ^^^^^^^^^^^^^^^^^^^^^#012
2025-05-20 13:51:26.492 2507379 ERROR cinder.api.v3.attachments   File "/usr/lib/python3/dist-packages/oslo_messaging/transport.py", line 123, in _send#012
2025-05-20 13:51:26.492 2507379 ERROR cinder.api.v3.attachments     return self._driver.send(target, ctxt, message,#012
2025-05-20 13:51:26.492 2507379 ERROR cinder.api.v3.attachments            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^#012
2025-05-20 13:51:26.492 2507379 ERROR cinder.api.v3.attachments   File "/usr/lib/python3/dist-packages/oslo_messaging/_drivers/amqpdriver.py", line 799, in send#012
2025-05-20 13:51:26.492 2507379 ERROR cinder.api.v3.attachments     return self._send(target, ctxt, message, wait_for_reply, timeout,#012
2025-05-20 13:51:26.492 2507379 ERROR cinder.api.v3.attachments            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^#012
2025-05-20 13:51:26.492 2507379 ERROR cinder.api.v3.attachments   File "/usr/lib/python3/dist-packages/oslo_messaging/_drivers/amqpdriver.py", line 788, in _send#012
2025-05-20 13:51:26.492 2507379 ERROR cinder.api.v3.attachments     result = self._waiter.wait(msg_id, timeout,#012
2025-05-20 13:51:26.492 2507379 ERROR cinder.api.v3.attachments              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^#012
2025-05-20 13:51:26.492 2507379 ERROR cinder.api.v3.attachments   File "/usr/lib/python3/dist-packages/oslo_messaging/_drivers/amqpdriver.py", line 653, in wait#012
2025-05-20 13:51:26.492 2507379 ERROR cinder.api.v3.attachments     message = self.waiters.get(msg_id, timeout=timeout)#012
2025-05-20 13:51:26.492 2507379 ERROR cinder.api.v3.attachments               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^#012
2025-05-20 13:51:26.492 2507379 ERROR cinder.api.v3.attachments   File "/usr/lib/python3/dist-packages/oslo_messaging/_drivers/amqpdriver.py", line 519, in get#012
2025-05-20 13:51:26.492 2507379 ERROR cinder.api.v3.attachments     raise oslo_messaging.MessagingTimeout(#012
2025-05-20 13:51:26.492 2507379 ERROR cinder.api.v3.attachments oslo_messaging.exceptions.MessagingTimeout: Timed out waiting for a reply to message ID 8e783d0471864169acad16c7ba387b1b#012
2025-05-20 13:51:26.492 2507379 ERROR cinder.api.v3.attachments

Event Timeline

Mentioned in SAL (#wikimedia-cloud) [2025-05-20T14:57:45Z] <andrewbogott> resetting eqiad1 rabbitmq in an attempt to resolve T394790

Rabbitmq rebuild did not seem to change anything.

A newly-created VM with attached volume seems to migrate fine. So this issue is specific to these particular VMs or volumes.

The volumes are:

linkwatcher coibot_xlinkbot_data eb45532f-e029-40fa-84b9-ce10357329db
linkwatcher linkwatcher_data e8296fd6-1a29-4e11-ac3e-4b0104a36804
trove trove-1ca5abab-f6f3-4f4a-b397-1d53df361267 c6c88ead-996b-4f91-9444-fc6964958337
toolsbeta toolsbeta-nfs 648504db-18c2-4cee-b731-567dcb4dadf6

In the 'volumes' table in the cinder database I see quite a few volumes with

host: cloudcontrol1005@rbd#RBD

After

update volumes set host='cloudcontrol1007@rbd#RBD' where id='eb45532f-e029-40fa-84b9-ce10357329db';

the coibot-2 volume is behaving more reasonably.

mysql:root@localhost [cinder]> update volumes set host='cloudcontrol1007@rbd#RBD' where host='cloudcontrol1005@rbd#RBD';  
Query OK, 1916 rows affected (0.069 sec)
Rows matched: 1916  Changed: 1916  Warnings: 0

...and just like that, everything is working and the remaining cloudvirts drained without trouble.