May 31 14:24:47 kafka1002 eventlogging-service-eventbus[17492]: (MainThread) Failed sending event <Event 5d5a6732-64de-11e8-8a60-141877614b03 of schema (u'mediawiki/job', 1)>. IllegalStateError: Producer is closed forcefully. Traceback (most recent call last): File "/srv/deployment/eventlogging/eventbus/eventlogging/service.py", line 263, in handle_events yield self.process_event(event) File "/usr/lib/python2.7/dist-packages/tornado/gen.py", line 1015, in run value = future.result() File "/usr/lib/python2.7/dist-packages/tornado/concurrent.py", line 237, in result raise_exc_info(self._exc_info) File "/usr/lib/python2.7/dist-packages/tornado/gen.py", line 1021, in run yielded = self.gen.throw(*exc_info) File "/srv/deployment/eventlogging/eventbus/eventlogging/service.py", line 242, in process_event yield self.send(event) File "/usr/lib/python2.7/dist-packages/tornado/gen.py", line 1015, in run value = future.result() File "/usr/lib/python2.7/dist-packages/tornado/concurrent.py", line 237, in result raise_exc_info(self._exc_info) File "/usr/lib/python2.7/dist-packages/tornado/gen.py", line 792, in callback result_list.append(f.result()) File "/usr/lib/python2.7/dist-packages/tornado/concurrent.py", line 237, in result raise_exc_info(self._exc_info) File "<string>", line 3, in raise_exc_info IllegalStateError: IllegalStateError: Producer is closed forcefully. May 31 14:24:47 kafka1002 eventlogging-service-eventbus[17492]: (MainThread) Failed processing event: Failed sending event <Event 5d5a6732-64de-11e8-8a60-141877614b03 of schema (u'mediawiki/job', 1)>. IllegalStateError: Producer is closed forcefully.
Description
Details
Subject | Repo | Branch | Lines +/- | |
---|---|---|---|---|
Make Kafka + eventlogging-service edge cases more resilient | eventlogging | master | +44 -19 |
Event Timeline
I think we should raise the priority of this issue. Not being able to deliver messages (and not retrying to do so) should be considered a serious operational problem, IMHO.
Agree! Ah the 'priority' here was kind of set accidentally to low. Phab will change the priority sometimes when we move it around in columns. I'll try to look at this today.
Change 436904 had a related patch set uploaded (by Ottomata; owner: Ottomata):
[eventlogging@master] Make Kafka + eventlogging-service edge cases more resilient
Change 436904 merged by Ottomata:
[eventlogging@master] Make Kafka + eventlogging-service edge cases more resilient
Mentioned in SAL (#wikimedia-operations) [2018-06-04T18:32:38Z] <otto@deploy1001> Started deploy [eventlogging/eventbus@3a5c395]: T196077
Mentioned in SAL (#wikimedia-operations) [2018-06-04T18:35:16Z] <ottomata> deploying eventlogging-service-eventbus for T196077
Mentioned in SAL (#wikimedia-operations) [2018-06-04T18:36:45Z] <otto@deploy1001> Finished deploy [eventlogging/eventbus@3a5c395]: T196077 (duration: 04m 07s)
Mentioned in SAL (#wikimedia-operations) [2018-06-04T18:44:48Z] <ottomata> bouncing kafka on kafka2003 to test T196077
Mentioned in SAL (#wikimedia-operations) [2018-06-04T19:00:32Z] <ottomata> bouncing kafka2003 again to test T196077 with python-kafka 1.4.3
Mentioned in SAL (#wikimedia-operations) [2018-06-04T19:04:45Z] <otto@deploy1001> Started restart [eventlogging/eventbus@3a5c395]: bouncing eventbus after upgrading to python-kafka 1.4.3 for T196077
Mentioned in SAL (#wikimedia-operations) [2018-06-04T19:06:45Z] <ottomata> bouncing kafka2003 one more time for T196077
Ok! I think this helped. Deploying my fix somehow exposed the underlying problem, which was https://github.com/dpkp/kafka-python/issues/1442. This has been fixed in kafka-python 1.4.3. I uploaded a new .deb and upgraded it. Now eventbus logs during leader elections look great!