Page MenuHomePhabricator

nova-fullstack leaking VMs: rabbitmq may have problems
Closed, ResolvedPublic

Description

We had an alert about nova-fullstack failing on cloudcontrol1005.

Upon log check, I found:

root@cloudcontrol1005:~# journalctl -u nova-fullstack | tail
Oct 07 08:10:52 cloudcontrol1005 systemd[1]: nova-fullstack.service: Failed with result 'exit-code'.
Oct 07 08:11:46 cloudcontrol1005 systemd[1]: Started test instance lifecycle (create/setup/delete).
Oct 07 08:11:47 cloudcontrol1005 nova-fullstack[3793005]: @cee: {"ecs.version": "1.7.0", "log.level": "DEBUG", "log.origin.file.line": 787, "log.origin.file.name": "nova-fullstack", "log.origin.file.path": "/usr/local/sbin/nova-fullstack", "log.origin.function": "main", "labels": {"test_hostname": "fullstackd-20221007081146"}, "message": "[<Server: fullstackd-20221007075047>, <Server: fullstackd-20221007073043>, <Server: fullstackd-20221007071036>, <Server: fullstackd-20221007065031>, <Server: fullstackd-20221007063027>, <Server: fullstackd-20221007061022>, <Server: fullstackd-20221007055018>, <Server: fullstackd-20221007053013>, <Server: fullstackd-20220821171748>, <Server: fullstackd-20221007051007>, <Server: fullstackd-20221007045000>]", "process.name": "MainProcess", "process.thread.id": 3793005, "process.thread.name": "MainThread", "timestamp": "2022-10-07T08:11:47.202064"}
Oct 07 08:11:47 cloudcontrol1005 nova-fullstack[3793005]: @cee: {"ecs.version": "1.7.0", "log.level": "INFO", "log.origin.file.line": 472, "log.origin.file.name": "nova-fullstack", "log.origin.file.path": "/usr/local/sbin/nova-fullstack", "log.origin.function": "add_stat", "labels": {"test_hostname": "fullstackd-20221007081146"}, "message": "cloudvps.novafullstack.instances.count => 11.000000 1665130307", "process.name": "MainProcess", "process.thread.id": 3793005, "process.thread.name": "MainThread", "timestamp": "2022-10-07T08:11:47.202408"}
Oct 07 08:11:47 cloudcontrol1005 nova-fullstack[3793005]: @cee: {"ecs.version": "1.7.0", "log.level": "INFO", "log.origin.file.line": 472, "log.origin.file.name": "nova-fullstack", "log.origin.file.path": "/usr/local/sbin/nova-fullstack", "log.origin.function": "add_stat", "labels": {"test_hostname": "fullstackd-20221007081146"}, "message": "cloudvps.novafullstack.instances.max => 11.000000 1665130307", "process.name": "MainProcess", "process.thread.id": 3793005, "process.thread.name": "MainThread", "timestamp": "2022-10-07T08:11:47.202539"}
Oct 07 08:11:47 cloudcontrol1005 nova-fullstack[3793005]: @cee: {"ecs.version": "1.7.0", "log.level": "INFO", "log.origin.file.line": 726, "log.origin.file.name": "nova-fullstack", "log.origin.file.path": "/usr/local/sbin/nova-fullstack", "log.origin.function": "cleanup_leaked_vms", "labels": {"test_hostname": "fullstackd-20221007081146"}, "message": "There are 11 (11 allowed) leaked instances with prefix fullstackd", "process.name": "MainProcess", "process.thread.id": 3793005, "process.thread.name": "MainThread", "timestamp": "2022-10-07T08:11:47.202643"}
Oct 07 08:11:47 cloudcontrol1005 nova-fullstack[3793005]: @cee: {"ecs.version": "1.7.0", "log.level": "INFO", "log.origin.file.line": 734, "log.origin.file.name": "nova-fullstack", "log.origin.file.path": "/usr/local/sbin/nova-fullstack", "log.origin.function": "cleanup_leaked_vms", "labels": {"test_hostname": "fullstackd-20221007081146"}, "message": "Cleaning up 1 VMs to make space for this run.", "process.name": "MainProcess", "process.thread.id": 3793005, "process.thread.name": "MainThread", "timestamp": "2022-10-07T08:11:47.202983"}
Oct 07 08:11:47 cloudcontrol1005 nova-fullstack[3793005]: @cee: {"ecs.version": "1.7.0", "log.level": "ERROR", "log.origin.file.line": 119, "log.origin.file.name": "nova-fullstack", "log.origin.file.path": "/usr/local/sbin/nova-fullstack", "log.origin.function": "log_unhandled_exception", "labels": {"test_hostname": "fullstackd-20221007081146"}, "message": "Unhandled exception: Instance 37071833-097a-43ca-a68b-38ac7798e150 could not be found. (HTTP 404) (Request-ID: req-530990e3-4d32-4293-81c6-c2a5d1554707)", "process.name": "MainProcess", "process.thread.id": 3793005, "process.thread.name": "MainThread", "timestamp": "2022-10-07T08:11:47.246164", "error.stack_trace": "Traceback (most recent call last):\n  File \"/usr/local/sbin/nova-fullstack\", line 934, in <module>\n    main()\n  File \"/usr/local/sbin/nova-fullstack\", line 796, in main\n    cleanup_leaked_vms(\n  File \"/usr/local/sbin/nova-fullstack\", line 739, in cleanup_leaked_vms\n    vm.delete()\n  File \"/usr/lib/python3/dist-packages/novaclient/v2/servers.py\", line 65, in delete\n    return self.manager.delete(self)\n  File \"/usr/lib/python3/dist-packages/novaclient/v2/servers.py\", line 1594, in delete\n    return self._delete(\"/servers/%s\" % base.getid(server))\n  File \"/usr/lib/python3/dist-packages/novaclient/base.py\", line 375, in _delete\n    resp, body = self.api.client.delete(url)\n  File \"/usr/lib/python3/dist-packages/keystoneauth1/adapter.py\", line 410, in delete\n    return self.request(url, 'DELETE', **kwargs)\n  File \"/usr/lib/python3/dist-packages/novaclient/client.py\", line 78, in request\n    raise exceptions.from_response(resp, body, url, method)\nnovaclient.exceptions.NotFound: Instance 37071833-097a-43ca-a68b-38ac7798e150 could not be found. (HTTP 404) (Request-ID: req-530990e3-4d32-4293-81c6-c2a5d1554707)"}
Oct 07 08:11:47 cloudcontrol1005 systemd[1]: nova-fullstack.service: Main process exited, code=exited, status=1/FAILURE
Oct 07 08:11:47 cloudcontrol1005 systemd[1]: nova-fullstack.service: Failed with result 'exit-code'.

Previous leaks:

root@cloudcontrol1005:~# journalctl -u nova-fullstack | grep -i leak
[..]
Oct 07 05:05:07 cloudcontrol1005 nova-fullstack[3578201]: @cee: {"ecs.version": "1.7.0", "log.level": "ERROR", "log.origin.file.line": 916, "log.origin.file.name": "nova-fullstack", "log.origin.file.path": "/usr/local/sbin/nova-fullstack", "log.origin.function": "main", "labels": {"test_hostname": "fullstackd-20221007045000"}, "message": "fullstackd-20221007045000 failed, leaking", "process.name": "MainProcess", "process.thread.id": 3578201, "process.thread.name": "MainThread", "timestamp": "2022-10-07T05:05:07.505510", "error.stack_trace": "Traceback (most recent call last):\n  File \"/usr/local/sbin/nova-fullstack\", line 814, in main\n    (verify_creation_time, new_vm) = verify_create(\n  File \"/usr/local/sbin/nova-fullstack\", line 439, in verify_create\n    raise Exception(f\"Creation of {new_vm.id} timed out\")\nException: Creation of 32a32ace-d51d-4bb7-900d-cd73f9a45e02 timed out"}
Oct 07 05:10:08 cloudcontrol1005 nova-fullstack[3578201]: @cee: {"ecs.version": "1.7.0", "log.level": "INFO", "log.origin.file.line": 726, "log.origin.file.name": "nova-fullstack", "log.origin.file.path": "/usr/local/sbin/nova-fullstack", "log.origin.function": "cleanup_leaked_vms", "labels": {"test_hostname": "fullstackd-20221007051007"}, "message": "There are 2 (11 allowed) leaked instances with prefix fullstackd", "process.name": "MainProcess", "process.thread.id": 3578201, "process.thread.name": "MainThread", "timestamp": "2022-10-07T05:10:08.144425"}
Oct 07 05:25:13 cloudcontrol1005 nova-fullstack[3578201]: @cee: {"ecs.version": "1.7.0", "log.level": "ERROR", "log.origin.file.line": 916, "log.origin.file.name": "nova-fullstack", "log.origin.file.path": "/usr/local/sbin/nova-fullstack", "log.origin.function": "main", "labels": {"test_hostname": "fullstackd-20221007051007"}, "message": "fullstackd-20221007051007 failed, leaking", "process.name": "MainProcess", "process.thread.id": 3578201, "process.thread.name": "MainThread", "timestamp": "2022-10-07T05:25:13.532800", "error.stack_trace": "Traceback (most recent call last):\n  File \"/usr/local/sbin/nova-fullstack\", line 814, in main\n    (verify_creation_time, new_vm) = verify_create(\n  File \"/usr/local/sbin/nova-fullstack\", line 439, in verify_create\n    raise Exception(f\"Creation of {new_vm.id} timed out\")\nException: Creation of 0235ca10-60b7-408e-a6ac-323b6ada6756 timed out"}
Oct 07 05:30:14 cloudcontrol1005 nova-fullstack[3578201]: @cee: {"ecs.version": "1.7.0", "log.level": "INFO", "log.origin.file.line": 726, "log.origin.file.name": "nova-fullstack", "log.origin.file.path": "/usr/local/sbin/nova-fullstack", "log.origin.function": "cleanup_leaked_vms", "labels": {"test_hostname": "fullstackd-20221007053013"}, "message": "There are 3 (11 allowed) leaked instances with prefix fullstackd", "process.name": "MainProcess", "process.thread.id": 3578201, "process.thread.name": "MainThread", "timestamp": "2022-10-07T05:30:14.208822"}
Oct 07 05:45:18 cloudcontrol1005 nova-fullstack[3578201]: @cee: {"ecs.version": "1.7.0", "log.level": "ERROR", "log.origin.file.line": 916, "log.origin.file.name": "nova-fullstack", "log.origin.file.path": "/usr/local/sbin/nova-fullstack", "log.origin.function": "main", "labels": {"test_hostname": "fullstackd-20221007053013"}, "message": "fullstackd-20221007053013 failed, leaking", "process.name": "MainProcess", "process.thread.id": 3578201, "process.thread.name": "MainThread", "timestamp": "2022-10-07T05:45:18.141277", "error.stack_trace": "Traceback (most recent call last):\n  File \"/usr/local/sbin/nova-fullstack\", line 814, in main\n    (verify_creation_time, new_vm) = verify_create(\n  File \"/usr/local/sbin/nova-fullstack\", line 439, in verify_create\n    raise Exception(f\"Creation of {new_vm.id} timed out\")\nException: Creation of e85b1254-f83e-4931-b1cc-8ba9878874e6 timed out"}
Oct 07 05:50:18 cloudcontrol1005 nova-fullstack[3578201]: @cee: {"ecs.version": "1.7.0", "log.level": "INFO", "log.origin.file.line": 726, "log.origin.file.name": "nova-fullstack", "log.origin.file.path": "/usr/local/sbin/nova-fullstack", "log.origin.function": "cleanup_leaked_vms", "labels": {"test_hostname": "fullstackd-20221007055018"}, "message": "There are 4 (11 allowed) leaked instances with prefix fullstackd", "process.name": "MainProcess", "process.thread.id": 3578201, "process.thread.name": "MainThread", "timestamp": "2022-10-07T05:50:18.675278"}
Oct 07 06:05:22 cloudcontrol1005 nova-fullstack[3578201]: @cee: {"ecs.version": "1.7.0", "log.level": "ERROR", "log.origin.file.line": 916, "log.origin.file.name": "nova-fullstack", "log.origin.file.path": "/usr/local/sbin/nova-fullstack", "log.origin.function": "main", "labels": {"test_hostname": "fullstackd-20221007055018"}, "message": "fullstackd-20221007055018 failed, leaking", "process.name": "MainProcess", "process.thread.id": 3578201, "process.thread.name": "MainThread", "timestamp": "2022-10-07T06:05:22.650843", "error.stack_trace": "Traceback (most recent call last):\n  File \"/usr/local/sbin/nova-fullstack\", line 814, in main\n    (verify_creation_time, new_vm) = verify_create(\n  File \"/usr/local/sbin/nova-fullstack\", line 439, in verify_create\n    raise Exception(f\"Creation of {new_vm.id} timed out\")\nException: Creation of 854991ac-669a-4564-9192-e7a924f10646 timed out"}
Oct 07 06:10:23 cloudcontrol1005 nova-fullstack[3578201]: @cee: {"ecs.version": "1.7.0", "log.level": "INFO", "log.origin.file.line": 726, "log.origin.file.name": "nova-fullstack", "log.origin.file.path": "/usr/local/sbin/nova-fullstack", "log.origin.function": "cleanup_leaked_vms", "labels": {"test_hostname": "fullstackd-20221007061022"}, "message": "There are 5 (11 allowed) leaked instances with prefix fullstackd", "process.name": "MainProcess", "process.thread.id": 3578201, "process.thread.name": "MainThread", "timestamp": "2022-10-07T06:10:23.365706"}
Oct 07 06:25:26 cloudcontrol1005 nova-fullstack[3578201]: @cee: {"ecs.version": "1.7.0", "log.level": "ERROR", "log.origin.file.line": 916, "log.origin.file.name": "nova-fullstack", "log.origin.file.path": "/usr/local/sbin/nova-fullstack", "log.origin.function": "main", "labels": {"test_hostname": "fullstackd-20221007061022"}, "message": "fullstackd-20221007061022 failed, leaking", "process.name": "MainProcess", "process.thread.id": 3578201, "process.thread.name": "MainThread", "timestamp": "2022-10-07T06:25:26.830298", "error.stack_trace": "Traceback (most recent call last):\n  File \"/usr/local/sbin/nova-fullstack\", line 814, in main\n    (verify_creation_time, new_vm) = verify_create(\n  File \"/usr/local/sbin/nova-fullstack\", line 439, in verify_create\n    raise Exception(f\"Creation of {new_vm.id} timed out\")\nException: Creation of 6256f812-fcc1-4a13-ae62-63dab3672ed1 timed out"}
Oct 07 06:30:27 cloudcontrol1005 nova-fullstack[3578201]: @cee: {"ecs.version": "1.7.0", "log.level": "INFO", "log.origin.file.line": 726, "log.origin.file.name": "nova-fullstack", "log.origin.file.path": "/usr/local/sbin/nova-fullstack", "log.origin.function": "cleanup_leaked_vms", "labels": {"test_hostname": "fullstackd-20221007063027"}, "message": "There are 6 (11 allowed) leaked instances with prefix fullstackd", "process.name": "MainProcess", "process.thread.id": 3578201, "process.thread.name": "MainThread", "timestamp": "2022-10-07T06:30:27.416135"}
Oct 07 06:45:31 cloudcontrol1005 nova-fullstack[3578201]: @cee: {"ecs.version": "1.7.0", "log.level": "ERROR", "log.origin.file.line": 916, "log.origin.file.name": "nova-fullstack", "log.origin.file.path": "/usr/local/sbin/nova-fullstack", "log.origin.function": "main", "labels": {"test_hostname": "fullstackd-20221007063027"}, "message": "fullstackd-20221007063027 failed, leaking", "process.name": "MainProcess", "process.thread.id": 3578201, "process.thread.name": "MainThread", "timestamp": "2022-10-07T06:45:31.067328", "error.stack_trace": "Traceback (most recent call last):\n  File \"/usr/local/sbin/nova-fullstack\", line 814, in main\n    (verify_creation_time, new_vm) = verify_create(\n  File \"/usr/local/sbin/nova-fullstack\", line 439, in verify_create\n    raise Exception(f\"Creation of {new_vm.id} timed out\")\nException: Creation of 95e1b570-e023-4999-8a0c-31e7f4df09e7 timed out"}
Oct 07 06:50:31 cloudcontrol1005 nova-fullstack[3578201]: @cee: {"ecs.version": "1.7.0", "log.level": "INFO", "log.origin.file.line": 726, "log.origin.file.name": "nova-fullstack", "log.origin.file.path": "/usr/local/sbin/nova-fullstack", "log.origin.function": "cleanup_leaked_vms", "labels": {"test_hostname": "fullstackd-20221007065031"}, "message": "There are 7 (11 allowed) leaked instances with prefix fullstackd", "process.name": "MainProcess", "process.thread.id": 3578201, "process.thread.name": "MainThread", "timestamp": "2022-10-07T06:50:31.756343"}
Oct 07 07:05:36 cloudcontrol1005 nova-fullstack[3578201]: @cee: {"ecs.version": "1.7.0", "log.level": "ERROR", "log.origin.file.line": 916, "log.origin.file.name": "nova-fullstack", "log.origin.file.path": "/usr/local/sbin/nova-fullstack", "log.origin.function": "main", "labels": {"test_hostname": "fullstackd-20221007065031"}, "message": "fullstackd-20221007065031 failed, leaking", "process.name": "MainProcess", "process.thread.id": 3578201, "process.thread.name": "MainThread", "timestamp": "2022-10-07T07:05:36.559650", "error.stack_trace": "Traceback (most recent call last):\n  File \"/usr/local/sbin/nova-fullstack\", line 814, in main\n    (verify_creation_time, new_vm) = verify_create(\n  File \"/usr/local/sbin/nova-fullstack\", line 439, in verify_create\n    raise Exception(f\"Creation of {new_vm.id} timed out\")\nException: Creation of abbee604-34ae-4ea9-bab4-8deb35b915b2 timed out"}
Oct 07 07:10:37 cloudcontrol1005 nova-fullstack[3578201]: @cee: {"ecs.version": "1.7.0", "log.level": "INFO", "log.origin.file.line": 726, "log.origin.file.name": "nova-fullstack", "log.origin.file.path": "/usr/local/sbin/nova-fullstack", "log.origin.function": "cleanup_leaked_vms", "labels": {"test_hostname": "fullstackd-20221007071036"}, "message": "There are 8 (11 allowed) leaked instances with prefix fullstackd", "process.name": "MainProcess", "process.thread.id": 3578201, "process.thread.name": "MainThread", "timestamp": "2022-10-07T07:10:37.229925"}
Oct 07 07:25:42 cloudcontrol1005 nova-fullstack[3578201]: @cee: {"ecs.version": "1.7.0", "log.level": "ERROR", "log.origin.file.line": 916, "log.origin.file.name": "nova-fullstack", "log.origin.file.path": "/usr/local/sbin/nova-fullstack", "log.origin.function": "main", "labels": {"test_hostname": "fullstackd-20221007071036"}, "message": "fullstackd-20221007071036 failed, leaking", "process.name": "MainProcess", "process.thread.id": 3578201, "process.thread.name": "MainThread", "timestamp": "2022-10-07T07:25:42.742903", "error.stack_trace": "Traceback (most recent call last):\n  File \"/usr/local/sbin/nova-fullstack\", line 814, in main\n    (verify_creation_time, new_vm) = verify_create(\n  File \"/usr/local/sbin/nova-fullstack\", line 439, in verify_create\n    raise Exception(f\"Creation of {new_vm.id} timed out\")\nException: Creation of 04bf478b-2a79-4772-a5b9-fb329f7b62f1 timed out"}
Oct 07 07:30:43 cloudcontrol1005 nova-fullstack[3578201]: @cee: {"ecs.version": "1.7.0", "log.level": "INFO", "log.origin.file.line": 726, "log.origin.file.name": "nova-fullstack", "log.origin.file.path": "/usr/local/sbin/nova-fullstack", "log.origin.function": "cleanup_leaked_vms", "labels": {"test_hostname": "fullstackd-20221007073043"}, "message": "There are 9 (11 allowed) leaked instances with prefix fullstackd", "process.name": "MainProcess", "process.thread.id": 3578201, "process.thread.name": "MainThread", "timestamp": "2022-10-07T07:30:43.444478"}
Oct 07 07:45:46 cloudcontrol1005 nova-fullstack[3578201]: @cee: {"ecs.version": "1.7.0", "log.level": "ERROR", "log.origin.file.line": 916, "log.origin.file.name": "nova-fullstack", "log.origin.file.path": "/usr/local/sbin/nova-fullstack", "log.origin.function": "main", "labels": {"test_hostname": "fullstackd-20221007073043"}, "message": "fullstackd-20221007073043 failed, leaking", "process.name": "MainProcess", "process.thread.id": 3578201, "process.thread.name": "MainThread", "timestamp": "2022-10-07T07:45:46.896295", "error.stack_trace": "Traceback (most recent call last):\n  File \"/usr/local/sbin/nova-fullstack\", line 814, in main\n    (verify_creation_time, new_vm) = verify_create(\n  File \"/usr/local/sbin/nova-fullstack\", line 439, in verify_create\n    raise Exception(f\"Creation of {new_vm.id} timed out\")\nException: Creation of 8a03da47-6934-4731-b24e-b1bf5cf7599d timed out"}
Oct 07 07:50:47 cloudcontrol1005 nova-fullstack[3578201]: @cee: {"ecs.version": "1.7.0", "log.level": "INFO", "log.origin.file.line": 726, "log.origin.file.name": "nova-fullstack", "log.origin.file.path": "/usr/local/sbin/nova-fullstack", "log.origin.function": "cleanup_leaked_vms", "labels": {"test_hostname": "fullstackd-20221007075047"}, "message": "There are 10 (11 allowed) leaked instances with prefix fullstackd", "process.name": "MainProcess", "process.thread.id": 3578201, "process.thread.name": "MainThread", "timestamp": "2022-10-07T07:50:47.774425"}
Oct 07 08:05:51 cloudcontrol1005 nova-fullstack[3578201]: @cee: {"ecs.version": "1.7.0", "log.level": "ERROR", "log.origin.file.line": 916, "log.origin.file.name": "nova-fullstack", "log.origin.file.path": "/usr/local/sbin/nova-fullstack", "log.origin.function": "main", "labels": {"test_hostname": "fullstackd-20221007075047"}, "message": "fullstackd-20221007075047 failed, leaking", "process.name": "MainProcess", "process.thread.id": 3578201, "process.thread.name": "MainThread", "timestamp": "2022-10-07T08:05:51.861824", "error.stack_trace": "Traceback (most recent call last):\n  File \"/usr/local/sbin/nova-fullstack\", line 814, in main\n    (verify_creation_time, new_vm) = verify_create(\n  File \"/usr/local/sbin/nova-fullstack\", line 439, in verify_create\n    raise Exception(f\"Creation of {new_vm.id} timed out\")\nException: Creation of 1b35e954-7d2b-4c54-afd1-3fe48db68ca7 timed out"}
Oct 07 08:10:52 cloudcontrol1005 nova-fullstack[3578201]: @cee: {"ecs.version": "1.7.0", "log.level": "INFO", "log.origin.file.line": 726, "log.origin.file.name": "nova-fullstack", "log.origin.file.path": "/usr/local/sbin/nova-fullstack", "log.origin.function": "cleanup_leaked_vms", "labels": {"test_hostname": "fullstackd-20221007081052"}, "message": "There are 11 (11 allowed) leaked instances with prefix fullstackd", "process.name": "MainProcess", "process.thread.id": 3578201, "process.thread.name": "MainThread", "timestamp": "2022-10-07T08:10:52.463235"}

Event Timeline

Mentioned in SAL (#wikimedia-cloud) [2022-10-07T08:49:31Z] <arturo> cleaning up a bunch of leaked VMs on "BUILD" status (T320232)

Mentioned in SAL (#wikimedia-cloud) [2022-10-07T09:03:23Z] <arturo> restarted nova-fullstack.service on cloudcontrol1005 (T320232)

Something is clearly wrong with rabbitmq:

aborrero@cumin1001:~ $ sudo cumin --force cloudrabbit1* 'grep error /var/log/rabbitmq/rabbit@cloudrabbit*.log | wc -l'
3 hosts will be targeted:
cloudrabbit[1001-1003].wikimedia.org
FORCE mode enabled, continuing without confirmation
===== NODE GROUP =====                                                                                                                                       
(1) cloudrabbit1002.wikimedia.org                                                                                                                            
----- OUTPUT of 'grep error /var/...bit*.log | wc -l' -----                                                                                                  
93299112                                                                                                                                                     
===== NODE GROUP =====                                                                                                                                       
(1) cloudrabbit1003.wikimedia.org                                                                                                                            
----- OUTPUT of 'grep error /var/...bit*.log | wc -l' -----                                                                                                  
26075                                                                                                                                                        
===== NODE GROUP =====                                                                                                                                       
(1) cloudrabbit1001.wikimedia.org                                                                                                                            
----- OUTPUT of 'grep error /var/...bit*.log | wc -l' -----                                                                                                  
38339                                                                                                                                                        
================

That amount of errors is just for today:

aborrero@cumin1001:~ $ sudo cumin --force cloudrabbit1* 'grep error /var/log/rabbitmq/rabbit@cloudrabbit*.log | grep 2022-10-07 | wc -l'
3 hosts will be targeted:
cloudrabbit[1001-1003].wikimedia.org
FORCE mode enabled, continuing without confirmation
===== NODE GROUP =====                                                                                                                                       
(1) cloudrabbit1002.wikimedia.org                                                                                                                            
----- OUTPUT of 'grep error /var/...22-10-07 | wc -l' -----                                                                                                  
93461959                                                                                                                                                     
===== NODE GROUP =====                                                                                                                                       
(1) cloudrabbit1003.wikimedia.org                                                                                                                            
----- OUTPUT of 'grep error /var/...22-10-07 | wc -l' -----                                                                                                  
26128                                                                                                                                                        
===== NODE GROUP =====                                                                                                                                       
(1) cloudrabbit1001.wikimedia.org                                                                                                                            
----- OUTPUT of 'grep error /var/...22-10-07 | wc -l' -----                                                                                                  
38408                                                                                                                                                        
================
aborrero renamed this task from nova-fullstack leaking VMs to nova-fullstack leaking VMs: rabbitmq may have problems.Oct 7 2022, 9:35 AM

Mentioned in SAL (#wikimedia-cloud) [2022-10-07T09:45:31Z] <arturo> restarting rabbitmq-server.service @ cloudrabbit1002 (T320232)

Mentioned in SAL (#wikimedia-cloud) [2022-10-07T10:19:16Z] <arturo> restarting nova-conductor in all 3 cloudcontrols (T320232)

The errors mentioning Discarding message [...] in an old incarnation are only happening on cloudrabbit1002:

fnegri@cumin1001:~$ sudo cumin cloudrabbit1* 'grep incarnation /var/log/rabbitmq/rabbit@cloudrabbit*.log | grep 2022-10-07 | wc -l'
3 hosts will be targeted:
cloudrabbit[1001-1003].wikimedia.org
Ok to proceed on 3 hosts? Enter the number of affected hosts to confirm or "q" to quit 3
===== NODE GROUP =====
(1) cloudrabbit1002.wikimedia.org
----- OUTPUT of 'grep incarnation...22-10-07 | wc -l' -----
101077647
===== NODE GROUP =====
(2) cloudrabbit[1001,1003].wikimedia.org
----- OUTPUT of 'grep incarnation...22-10-07 | wc -l' -----
0
================

Mentioned in SAL (#wikimedia-cloud) [2022-10-07T10:24:40Z] <arturo> stopping rabbitmq-server.service @ cloudrabbit1002 (T320232)

For the record:

aborrero@cloudrabbit1002:~$ sudo rabbitmqctl cluster_status
Cluster status of node rabbit@cloudrabbit1002 ...
Basics

Cluster name: rabbit@cloudrabbit1001.wikimedia.org

Disk Nodes

rabbit@cloudrabbit1001
rabbit@cloudrabbit1002
rabbit@cloudrabbit1003

Running Nodes

rabbit@cloudrabbit1001
rabbit@cloudrabbit1002
rabbit@cloudrabbit1003

Versions

rabbit@cloudrabbit1001: RabbitMQ 3.8.9 on Erlang 23.2.6
rabbit@cloudrabbit1002: RabbitMQ 3.8.9 on Erlang 23.2.6
rabbit@cloudrabbit1003: RabbitMQ 3.8.9 on Erlang 23.2.6

Maintenance status

Node: rabbit@cloudrabbit1001, status: not under maintenance
Node: rabbit@cloudrabbit1002, status: not under maintenance
Node: rabbit@cloudrabbit1003, status: not under maintenance

Alarms

(none)

Network Partitions

(none)

Listeners

Node: rabbit@cloudrabbit1001, interface: [::], port: 15672, protocol: http, purpose: HTTP API
Node: rabbit@cloudrabbit1001, interface: [::], port: 15692, protocol: http/prometheus, purpose: Prometheus exporter API over HTTP
Node: rabbit@cloudrabbit1001, interface: [::], port: 25672, protocol: clustering, purpose: inter-node and CLI tool communication
Node: rabbit@cloudrabbit1001, interface: [::], port: 5672, protocol: amqp, purpose: AMQP 0-9-1 and AMQP 1.0
Node: rabbit@cloudrabbit1001, interface: [::], port: 5671, protocol: amqp/ssl, purpose: AMQP 0-9-1 and AMQP 1.0 over TLS
Node: rabbit@cloudrabbit1002, interface: [::], port: 25672, protocol: clustering, purpose: inter-node and CLI tool communication
Node: rabbit@cloudrabbit1002, interface: [::], port: 15672, protocol: http, purpose: HTTP API
Node: rabbit@cloudrabbit1002, interface: [::], port: 15692, protocol: http/prometheus, purpose: Prometheus exporter API over HTTP
Node: rabbit@cloudrabbit1002, interface: [::], port: 5672, protocol: amqp, purpose: AMQP 0-9-1 and AMQP 1.0
Node: rabbit@cloudrabbit1002, interface: [::], port: 5671, protocol: amqp/ssl, purpose: AMQP 0-9-1 and AMQP 1.0 over TLS
Node: rabbit@cloudrabbit1003, interface: [::], port: 15672, protocol: http, purpose: HTTP API
Node: rabbit@cloudrabbit1003, interface: [::], port: 15692, protocol: http/prometheus, purpose: Prometheus exporter API over HTTP
Node: rabbit@cloudrabbit1003, interface: [::], port: 25672, protocol: clustering, purpose: inter-node and CLI tool communication
Node: rabbit@cloudrabbit1003, interface: [::], port: 5672, protocol: amqp, purpose: AMQP 0-9-1 and AMQP 1.0
Node: rabbit@cloudrabbit1003, interface: [::], port: 5671, protocol: amqp/ssl, purpose: AMQP 0-9-1 and AMQP 1.0 over TLS

Feature flags

Flag: drop_unroutable_metric, state: enabled
Flag: empty_basic_get_metric, state: enabled
Flag: implicit_default_bindings, state: enabled
Flag: maintenance_mode_status, state: enabled
Flag: quorum_queue, state: enabled
Flag: virtual_host_metadata, state: enabled

Change 840107 had a related patch set uploaded (by Arturo Borrero Gonzalez; author: Arturo Borrero Gonzalez):

[operations/puppet@production] cloud: eqiad1: depool rabbitmq02

https://gerrit.wikimedia.org/r/840107

Mentioned in SAL (#wikimedia-cloud) [2022-10-07T11:33:21Z] <arturo> rabbitmq-server.service @ cloudrabbit1002 is again up and running (T320232)

fnegri changed the task status from Open to In Progress.Oct 7 2022, 1:41 PM

I performed a complete reset of the RabbitMQ cluster, following a procedure @Andrew had done before. I documented the steps here.

Change 840107 abandoned by Arturo Borrero Gonzalez:

[operations/puppet@production] cloud: eqiad1: depool rabbitmq02

Reason:

no longer required

https://gerrit.wikimedia.org/r/840107