Page MenuHomePhabricator

rabbitmq: connectivity issues between cloudservices1004 and rabbitmq
Closed, ResolvedPublic

Description

I just spotted this in cloudservices1004 logs:

/var/log/designate/designate-sink.log
2019-06-04 12:05:31.315 117528 ERROR oslo.messaging._drivers.impl_rabbit [-] AMQP server on cloudcontrol1004.wikimedia.org:5672 is unreachable: [Errno 104] Connection reset by peer. Trying again in 1 seconds.
2019-06-04 12:05:31.316 117528 ERROR oslo.messaging._drivers.impl_rabbit [-] AMQP server on cloudcontrol1003.wikimedia.org:5672 is unreachable: Too many heartbeats missed. Trying again in 1 seconds.
2019-06-04 12:09:53.474 117528 ERROR oslo_messaging.notify.dispatcher [req-6b8b50b2-2a33-4577-9897-e65d5bfd5174 - - - - -] Exception during message handling
2019-06-04 12:09:53.474 117528 ERROR oslo_messaging.notify.dispatcher Traceback (most recent call last):
2019-06-04 12:09:53.474 117528 ERROR oslo_messaging.notify.dispatcher   File "/usr/lib/python2.7/dist-packages/oslo_messaging/notify/dispatcher.py", line 86, in _dispatch_and_handle_error
2019-06-04 12:09:53.474 117528 ERROR oslo_messaging.notify.dispatcher     return self._dispatch(incoming)
2019-06-04 12:09:53.474 117528 ERROR oslo_messaging.notify.dispatcher   File "/usr/lib/python2.7/dist-packages/oslo_messaging/notify/dispatcher.py", line 122, in _dispatch
2019-06-04 12:09:53.474 117528 ERROR oslo_messaging.notify.dispatcher     ret = self._exec_callback(callback, filtered_messages)
2019-06-04 12:09:53.474 117528 ERROR oslo_messaging.notify.dispatcher   File "/usr/lib/python2.7/dist-packages/oslo_messaging/notify/dispatcher.py", line 169, in _exec_callback
2019-06-04 12:09:53.474 117528 ERROR oslo_messaging.notify.dispatcher     messages[0]["metadata"])
2019-06-04 12:09:53.474 117528 ERROR oslo_messaging.notify.dispatcher   File "/usr/lib/python2.7/dist-packages/oslo_messaging/notify/dispatcher.py", line 129, in _exec_callback
2019-06-04 12:09:53.474 117528 ERROR oslo_messaging.notify.dispatcher     ret = callback(*args)
2019-06-04 12:09:53.474 117528 ERROR oslo_messaging.notify.dispatcher   File "/usr/lib/python2.7/dist-packages/designate/sink/service.py", line 116, in info
2019-06-04 12:09:53.474 117528 ERROR oslo_messaging.notify.dispatcher     handler.process_notification(context, event_type, payload)
2019-06-04 12:09:53.474 117528 ERROR oslo_messaging.notify.dispatcher   File "/usr/lib/python2.7/dist-packages/wmf_sink/wmfsink.py", line 83, in process_notification
2019-06-04 12:09:53.474 117528 ERROR oslo_messaging.notify.dispatcher     resource_type='instance')
2019-06-04 12:09:53.474 117528 ERROR oslo_messaging.notify.dispatcher   File "/usr/lib/python2.7/dist-packages/wmf_sink/base.py", line 110, in _delete
2019-06-04 12:09:53.474 117528 ERROR oslo_messaging.notify.dispatcher     ip = records[0].data
2019-06-04 12:09:53.474 117528 ERROR oslo_messaging.notify.dispatcher   File "/usr/lib/python2.7/dist-packages/designate/objects/base.py", line 570, in __getitem__
2019-06-04 12:09:53.474 117528 ERROR oslo_messaging.notify.dispatcher     return self.objects[index]
2019-06-04 12:09:53.474 117528 ERROR oslo_messaging.notify.dispatcher IndexError: list index out of range
2019-06-04 12:09:53.474 117528 ERROR oslo_messaging.notify.dispatcher 
2019-06-04 12:09:53.477 117528 ERROR oslo_messaging.notify.dispatcher [req-6b8b50b2-2a33-4577-9897-e65d5bfd5174 - - - - -] Fail to ack/requeue message
2019-06-04 12:09:53.477 117528 ERROR oslo_messaging.notify.dispatcher Traceback (most recent call last):
2019-06-04 12:09:53.477 117528 ERROR oslo_messaging.notify.dispatcher   File "/usr/lib/python2.7/dist-packages/oslo_messaging/notify/dispatcher.py", line 75, in _post_dispatch
2019-06-04 12:09:53.477 117528 ERROR oslo_messaging.notify.dispatcher     m.acknowledge()
2019-06-04 12:09:53.477 117528 ERROR oslo_messaging.notify.dispatcher   File "/usr/lib/python2.7/dist-packages/oslo_messaging/_drivers/amqpdriver.py", line 121, in acknowledge
2019-06-04 12:09:53.477 117528 ERROR oslo_messaging.notify.dispatcher     self.message.acknowledge()
2019-06-04 12:09:53.477 117528 ERROR oslo_messaging.notify.dispatcher   File "/usr/lib/python2.7/dist-packages/oslo_messaging/_drivers/impl_rabbit.py", line 239, in acknowledge
2019-06-04 12:09:53.477 117528 ERROR oslo_messaging.notify.dispatcher     self._raw_message.ack()
2019-06-04 12:09:53.477 117528 ERROR oslo_messaging.notify.dispatcher   File "/usr/lib/python2.7/dist-packages/kombu/message.py", line 88, in ack
2019-06-04 12:09:53.477 117528 ERROR oslo_messaging.notify.dispatcher     self.channel.basic_ack(self.delivery_tag)
2019-06-04 12:09:53.477 117528 ERROR oslo_messaging.notify.dispatcher   File "/usr/lib/python2.7/dist-packages/amqp/channel.py", line 1581, in basic_ack
2019-06-04 12:09:53.477 117528 ERROR oslo_messaging.notify.dispatcher     self._send_method((60, 80), args)
2019-06-04 12:09:53.477 117528 ERROR oslo_messaging.notify.dispatcher   File "/usr/lib/python2.7/dist-packages/amqp/abstract_channel.py", line 50, in _send_method
2019-06-04 12:09:53.477 117528 ERROR oslo_messaging.notify.dispatcher     raise RecoverableConnectionError('connection already closed')
2019-06-04 12:09:53.477 117528 ERROR oslo_messaging.notify.dispatcher RecoverableConnectionError: connection already closed
2019-06-04 12:09:53.477 117528 ERROR oslo_messaging.notify.dispatcher

The error is repeated many times. We apparently have the correct firewall rules in cloudcontrol1003/1004:

aborrero@cloudcontrol1003:~ $ sudo ip6tables-save -c | grep 2620:0:861:1:208:80:154:11
[0:0] -A INPUT -s 2620:0:861:1:208:80:154:11/128 -p tcp -m tcp --dport 35357 -j ACCEPT
[0:0] -A INPUT -s 2620:0:861:1:208:80:154:11/128 -p tcp -m tcp --dport 35357 -j ACCEPT
[3:240] -A INPUT -s 2620:0:861:1:208:80:154:11/128 -p tcp -m tcp --dport 5672 -j ACCEPT
aborrero@cloudcontrol1003:~ $ sudo iptables-save -c | grep 208.80.154.11
[0:0] -A INPUT -s 208.80.154.11/32 -p tcp -m tcp --dport 35357 -j ACCEPT
[0:0] -A INPUT -s 208.80.154.11/32 -p tcp -m tcp --dport 35357 -j ACCEPT
[0:0] -A INPUT -s 208.80.154.11/32 -p tcp -m tcp --dport 5672 -j ACCEPT

aborrero@cloudcontrol1004:~ $ sudo iptables-save -c | grep 208.80.154.11
[0:0] -A INPUT -s 208.80.154.11/32 -p tcp -m tcp --dport 35357 -j ACCEPT
[0:0] -A INPUT -s 208.80.154.11/32 -p tcp -m tcp --dport 35357 -j ACCEPT
[0:0] -A INPUT -s 208.80.154.11/32 -p tcp -m tcp --dport 5672 -j ACCEPT
aborrero@cloudcontrol1004:~ $ sudo ip6tables-save -c | grep 2620:0:861:1:208:80:154:11
[0:0] -A INPUT -s 2620:0:861:1:208:80:154:11/128 -p tcp -m tcp --dport 35357 -j ACCEPT
[0:0] -A INPUT -s 2620:0:861:1:208:80:154:11/128 -p tcp -m tcp --dport 35357 -j ACCEPT
[2:160] -A INPUT -s 2620:0:861:1:208:80:154:11/128 -p tcp -m tcp --dport 5672 -j ACCEPT

So I wonder if we have other filters anywhere in the server.

Event Timeline

aborrero moved this task from Inbox to Soon! on the cloud-services-team (Kanban) board.

I see at least some messages getting through. So I don't think this is exactly a connectivity issue. It could be some kind of time-out-waiting-for-ipv6 thing (although I have no evidence for that) or it might be that the designate services don't support multiple workers and are somehow knocking each other off of the queue. If that were the case, though, I'd expect to see these errors on cloudservices1003 as well at least some of the time.

there's also a fair bit of this:

2019-06-05 06:51:45.070 117528 ERROR oslo.messaging._drivers.impl_rabbit [-] AMQP server on cloudcontrol1004.wikimedia.org:5672 is unreachable: Too many heartbeats missed. Trying again in 1 seconds.
2019-06-05 06:51:45.072 117528 ERROR oslo.messaging._drivers.impl_rabbit [-] AMQP server on cloudcontrol1003.wikimedia.org:5672 is unreachable: Too many heartbeats missed. Trying again in 1 seconds.

The designate HA guide is here:

https://docs.openstack.org/designate/latest/admin/ha.html

That definitely encourages us to do what we're doing (run multiple sinks on multiple hosts.) However, that guide was released with Queens, and we are running Mitaka, so it's possible it's somehow unsupported in M.

I think that some (but possibly not all) of the noise in the -sink logs was caused by timing issues related to T225168.

Andrew renamed this task from rabbitmq: connectivity issues between cloudservices1004 and cloudcontrol1003 to rabbitmq: connectivity issues between cloudservices1004 and rabbitmq.Jun 5 2019, 10:47 PM

Change 514738 had a related patch set uploaded (by Jhedden; owner: Jhedden):
[operations/puppet@production] openstack: allow designate to access puppet encapi

https://gerrit.wikimedia.org/r/514738

Change 514738 merged by Jhedden:
[operations/puppet@production] openstack: allow designate to access puppet encapi

https://gerrit.wikimedia.org/r/514738

JHedden claimed this task.