Page MenuHomePhabricator

cloudcontrol rabbitmq very busy
Closed, ResolvedPublic

Description

While investigating T271058 I'm seeing lots of warnings in openstack server like this:

Function 'neutron.agent.dhcp.agent.DhcpAgentWithStateReport._report_state' run outlasted interval by 0.57 sec

That seems to be related to rabbitmq congestion. On the cloudcontrols I see rabbit using >100% cpu.

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald Transcript

Change 655275 had a related patch set uploaded (by Andrew Bogott; owner: Andrew Bogott):
[operations/puppet@production] OpenStack haproxy: change http service health check interval to 3s

https://gerrit.wikimedia.org/r/655275

Change 655275 merged by Andrew Bogott:
[operations/puppet@production] OpenStack haproxy: change http service health check interval to 3s

https://gerrit.wikimedia.org/r/655275

Change 655277 had a related patch set uploaded (by Andrew Bogott; owner: Andrew Bogott):
[operations/puppet@production] OpenStack rabbitmq: set busy wait threshold to 'none'

https://gerrit.wikimedia.org/r/655277

Change 655277 merged by Andrew Bogott:
[operations/puppet@production] OpenStack rabbitmq: set busy wait threshold to 'none'

https://gerrit.wikimedia.org/r/655277

Change 655278 had a related patch set uploaded (by Andrew Bogott; owner: Andrew Bogott):
[operations/puppet@production] When changing rabbitmq-env.conf, notify rabbit service

https://gerrit.wikimedia.org/r/655278

Change 655278 merged by Andrew Bogott:
[operations/puppet@production] When changing rabbitmq-env.conf, notify rabbit service

https://gerrit.wikimedia.org/r/655278

Andrew claimed this task.
Andrew removed a project: Patch-For-Review.

Attached patches seem to have helped, but what REALLY helped was killing off the two rogue rabbit-mq processes running on cloudcontrol1003 that didn't die when I did a service restart.

I don't see any 'outlasted' complaints in neutron logs anymore so I think this is resolved.

aborrero added a subscriber: aborrero.

There are several of them:

image.png (362×1 px, 91 KB)

1004 is still showing some 'outlasted' warnings but 1003 hasn't had any for a few days. This might correspond with David removing those backup jobs that were gumming everything up. Let's see if 1004 is happy too after the firmware upgrade and reboot pending for T271058

Andrew lowered the priority of this task from High to Medium.Mar 9 2021, 5:37 PM

this doesn't seem to be causing problems anymore.