Page MenuHomePhabricator

rabbitmq: missing heartbeats issue
Open, LowPublic

Description

@dcaro has been chasing a weird missing heartbeats issue with rabbitmq @ openstack.

Hints collected so far points to an incompatibility with the evenlet monke patching.

Some references:

  • logs in rabbitmq:
2023-09-21 09:21:37.933201+00:00 [info] <0.621.918> connection <0.621.918> (10.64.20.44:47856 -> 208.80.154.73:5671 - nova-compute:2065772:90d3aeb5-cd29-413c-8821-d254c0884316): user 'nova' authenticated and granted access to vhost '/'
heartbeat_in_pthread
    Type:
        boolean
    Default:
        False
Run the health check heartbeat thread through a native python thread by default. If this option is equal to False then the health check heartbeat will inherit the execution model from the parent process. For example if the parent process has monkey patched the stdlib by using eventlet/greenlet then the heartbeat will be run through a green thread. This option should be set to True only for the wsgi services.
  • logs in cloudcontrol1005
2023-09-21 09:28:37.019735+00:00 [error] <0.2781.1416> closing AMQP connection <0.2781.1416> (10.64.151.3:57868 -> 208.80.155.102:5671 - uwsgi:3699517:e622f610-c3cc-4145-9a54-84ac5a46bad9):
2023-09-21 09:28:37.019735+00:00 [error] <0.2781.1416> missed heartbeats from client, timeout: 60s
  • logs in cloudcontrol1005
Sep 21 09:02:18 cloudcontrol1005 nova-scheduler[3699380]: Modules with known eventlet monkey patching issues were imported prior to eventlet monkey patching: urllib3. This warning can usually be ignored if the caller is only importing and not executing nova code.
  • logs in some hypervisor:
Sep 21 09:33:39 cloudvirt-wdqs1001 nova-compute[2069944]: Modules with known eventlet monkey patching issues were imported prior to eventlet monkey patching: urllib3. This warning can usually be ignored if the caller is only importing and not executing nova code.

Event Timeline

The Cloud-Services project tag is not intended to have any tasks. Please check the list on https://phabricator.wikimedia.org/project/profile/832/ and replace it with a more specific project tag to this task. Thanks!

Aklapper renamed this task from rabbitmq: missing hearbeats issue to rabbitmq: missing heartbeats issue.Sep 21 2023, 12:19 PM
Aklapper updated the task description. (Show Details)
fgiunchedi subscribed.

No longer observed

Unfortunately I was too hasty: we do still see this, though it is unclear to me whether what the impact (if any) is

root@cloudrabbit1002:/var/log/rabbitmq# grep missed rabbit@cloudrabbit1002.private.eqiad.wikimedia.cloud.log
2026-03-31 10:48:35.300465+00:00 [error] <0.9899226.0> missed heartbeats from client, timeout: 180s
2026-03-31 13:53:25.812559+00:00 [error] <0.7820951.0> missed heartbeats from client, timeout: 180s
2026-03-31 14:56:44.079437+00:00 [error] <0.43258360.0> missed heartbeats from client, timeout: 180s
root@cloudrabbit1002:/var/log/rabbitmq# grep 0.9899226 rabbit@cloudrabbit1002.private.eqiad.wikimedia.cloud.log
2026-03-31 10:48:35.300465+00:00 [error] <0.9899226.0> closing AMQP connection <0.9899226.0> ([2a02:ec80:a000:203::18]:37996 -> [2a02:ec80:a000:203::23]:5671 - uwsgi:872731:1795ca23-86dd-4376-9fff-37f5372fa749, duration: '22H, 18M, 0s'):
2026-03-31 10:48:35.300465+00:00 [error] <0.9899226.0> missed heartbeats from client, timeout: 180s
root@cloudrabbit1002:/var/log/rabbitmq# grep 0.7820951.0 rabbit@cloudrabbit1002.private.eqiad.wikimedia.cloud.log
2026-03-31 13:53:25.812559+00:00 [error] <0.7820951.0> closing AMQP connection <0.7820951.0> ([2a02:ec80:a000:201::25]:45154 -> [2a02:ec80:a000:203::23]:5671 - uwsgi:605719:74a7a8ab-cde6-4e59-8471-0c1e57ad3821, duration: '1D, 2H, 18M, 0s'):
2026-03-31 13:53:25.812559+00:00 [error] <0.7820951.0> missed heartbeats from client, timeout: 180s
root@cloudrabbit1002:/var/log/rabbitmq# grep 0.43258360.0 rabbit@cloudrabbit1002.private.eqiad.wikimedia.cloud.log
2026-03-31 04:47:43.873140+00:00 [info] <0.43258360.0> accepting AMQP connection [2a02:ec80:a000:203::18]:60292 -> [2a02:ec80:a000:203::23]:5671
2026-03-31 04:47:43.875338+00:00 [info] <0.43258360.0> connection [2a02:ec80:a000:203::18]:60292 -> [2a02:ec80:a000:203::23]:5671 has a client-provided name: uwsgi:872732:23bd1a62-6ea5-423d-9fa8-50c1dd8d55d2
2026-03-31 04:47:43.877437+00:00 [info] <0.43258360.0> connection [2a02:ec80:a000:203::18]:60292 -> [2a02:ec80:a000:203::23]:5671 - uwsgi:872732:23bd1a62-6ea5-423d-9fa8-50c1dd8d55d2: user 'nova' authenticated and granted access to vhost '/'
2026-03-31 14:56:44.079437+00:00 [error] <0.43258360.0> closing AMQP connection <0.43258360.0> ([2a02:ec80:a000:203::18]:60292 -> [2a02:ec80:a000:203::23]:5671 - uwsgi:872732:23bd1a62-6ea5-423d-9fa8-50c1dd8d55d2, duration: '10H, 9M, 0s'):
2026-03-31 14:56:44.079437+00:00 [error] <0.43258360.0> missed heartbeats from client, timeout: 180s

re-opening though again TBD if it is an actual issue