Page MenuHomePhabricator

Eventbus POST event failures after kafka 0.9 upgrade
Closed, DuplicatePublic

Description

08:37 <icinga-wm> PROBLEM - eventlogging-service-eventbus endpoints health on kafka2002 is CRITICAL: /v1/events (Produce a valid test event) is CRITICAL: Test Produce a valid test event returned the unexpected status 500 (expecting: 201)

Kafka seems ok, but from the eventlogging logs I can see:

Jul 26 07:05:45 kafka2002 eventlogging-service-eventbus[107460]: (MainThread) 500 POST /v1/events (10.192.16.169) 2.24ms
Jul 26 07:09:44 kafka2002 eventlogging-service-eventbus[107455]: (MainThread) Unable to receive data from Kafka
                                                                 Traceback (most recent call last):
                                                                   File "/usr/lib/python2.7/dist-packages/kafka/conn.py", line 103, in _read_bytes
                                                                     raise socket.error("Not enough data to read message -- did server kill socket?")
                                                                 error: Not enough data to read message -- did server kill socket?
Jul 26 07:09:44 kafka2002 eventlogging-service-eventbus[107455]: (MainThread) ConnectionError attempting to receive a response to request 4 from server BrokerMetadata(nodeId=2001, host='kafka2001.codfw.wmnet', port=9092): Kafka @ kafka2001.codfw.wmnet:9092 went away
Jul 26 07:09:44 kafka2002 eventlogging-service-eventbus[107455]: (MainThread) Unable to send messages
                                                                 Traceback (most recent call last):
                                                                   File "/usr/lib/python2.7/dist-packages/kafka/producer/base.py", line 390, in _send_messages
                                                                     fail_on_error=self.sync_fail_on_error
                                                                   File "/usr/lib/python2.7/dist-packages/kafka/client.py", line 480, in send_produce_request
                                                                     (not fail_on_error or not self._raise_on_response_error(resp))]
                                                                   File "/usr/lib/python2.7/dist-packages/kafka/client.py", line 247, in _raise_on_response_error
                                                                     raise resp
                                                                 FailedPayloadsError
Jul 26 07:09:44 kafka2002 eventlogging-service-eventbus[107455]: (MainThread) Uncaught exception POST /v1/events (10.192.16.169)
                                                                 HTTPServerRequest(protocol='http', host='10.192.16.169', method='POST', uri='/v1/events', version='HTTP/1.1', remote_ip='10.192.16.169', headers={'Content-Length': '105', 'Content-Type': 'application/json',
                                                                 Traceback (most recent call last):
                                                                   File "/usr/lib/python2.7/dist-packages/tornado/web.py", line 1415, in _execute
                                                                     result = yield result
                                                                   File "/usr/lib/python2.7/dist-packages/tornado/gen.py", line 870, in run
                                                                     value = future.result()
                                                                   File "/usr/lib/python2.7/dist-packages/tornado/concurrent.py", line 215, in result
                                                                     raise_exc_info(self._exc_info)
                                                                   File "/usr/lib/python2.7/dist-packages/tornado/gen.py", line 876, in run
                                                                     yielded = self.gen.throw(*exc_info)
                                                                   File "/srv/deployment/eventlogging/eventbus/eventlogging/service.py", line 292, in post
                                                                     self.application.handle_events, events
                                                                   File "/usr/lib/python2.7/dist-packages/tornado/gen.py", line 870, in run
                                                                     value = future.result()
                                                                   File "/usr/lib/python2.7/dist-packages/tornado/concurrent.py", line 215, in result
                                                                     raise_exc_info(self._exc_info)
                                                                   File "/usr/lib/python2.7/dist-packages/tornado/gen.py", line 505, in Task
                                                                     func(*args, callback=_argument_adapter(set_result), **kwargs)
                                                                   File "/srv/deployment/eventlogging/eventbus/eventlogging/service.py", line 192, in handle_events
                                                                     self.process_event(event)
                                                                   File "/srv/deployment/eventlogging/eventbus/eventlogging/service.py", line 175, in process_event
                                                                     self.send(event)
                                                                   File "/srv/deployment/eventlogging/eventbus/eventlogging/service.py", line 127, in send
                                                                     w.send(event)
                                                                   File "/srv/deployment/eventlogging/eventbus/eventlogging/handlers.py", line 253, in kafka_writer
                                                                     kafka_producer.send_messages(message_topic, message)
                                                                   File "/usr/lib/python2.7/dist-packages/kafka/producer/simple.py", line 54, in send_messages
                                                                     topic, partition, *msg
                                                                   File "/usr/lib/python2.7/dist-packages/kafka/producer/base.py", line 349, in send_messages
                                                                     return self._send_messages(topic, partition, *msg)
                                                                   File "/usr/lib/python2.7/dist-packages/kafka/producer/base.py", line 390, in _send_messages
                                                                     fail_on_error=self.sync_fail_on_error
                                                                   File "/usr/lib/python2.7/dist-packages/kafka/client.py", line 480, in send_produce_request
                                                                     (not fail_on_error or not self._raise_on_response_error(resp))]
                                                                   File "/usr/lib/python2.7/dist-packages/kafka/client.py", line 247, in _raise_on_response_error
                                                                     raise resp
                                                                 FailedPayloadsError
Jul 26 07:09:44 kafka2002 eventlogging-service-eventbus[107455]: (MainThread) 500 POST /v1/events (10.192.16.169) 3.15ms
Jul 26 07:11:41 kafka2002 eventlogging-service-eventbus[107460]: (MainThread) Writer kafka:///kafka2002.codfw.wmnet:9092,kafka2001.codfw.wmnet:9092?async=False&topic=codfw.{meta[topic]} has stopped: .  Attempting to restart.

Event Timeline

After restarting the eventbus service no issue has been registered, but I didn't try to send a fake test post to eventbus.

Got another alert now.

The problem seems to be coming from:

python-kafka: /usr/lib/python2.7/dist-packages/kafka/client.py

Hmmmmm.....I'm not sure if this is new, but maybe just newly noticed?

Yesterday Petr noticed this too. I went and looked at eqiad production eventlogging-service-eventbus logs, and saw that this error happens maybe about once a day, but for the most part things seem fine. I created https://phabricator.wikimedia.org/T141285 to fix.

If this is happening more regularly now in codfw with test events, I'm inclined to fix first. The bug in kafka-python that causes this has been fixed, but there may be some other problems in newer versions. I'll investigate. We'll hold off upgrading main-eqiad to Kafka 0.9 until we have more information.