May 31 14:24:47 kafka1002 eventlogging-service-eventbus[17492]: (MainThread) Failed sending event <Event 5d5a6732-64de-11e8-8a60-141877614b03 of schema (u'mediawiki/job', 1)>. IllegalStateError: Producer is closed forcefully.
Traceback (most recent call last):
File "/srv/deployment/eventlogging/eventbus/eventlogging/service.py", line 263, in handle_events
yield self.process_event(event)
File "/usr/lib/python2.7/dist-packages/tornado/gen.py", line 1015, in run
value = future.result()
File "/usr/lib/python2.7/dist-packages/tornado/concurrent.py", line 237, in result
raise_exc_info(self._exc_info)
File "/usr/lib/python2.7/dist-packages/tornado/gen.py", line 1021, in run
yielded = self.gen.throw(*exc_info)
File "/srv/deployment/eventlogging/eventbus/eventlogging/service.py", line 242, in process_event
yield self.send(event)
File "/usr/lib/python2.7/dist-packages/tornado/gen.py", line 1015, in run
value = future.result()
File "/usr/lib/python2.7/dist-packages/tornado/concurrent.py", line 237, in result
raise_exc_info(self._exc_info)
File "/usr/lib/python2.7/dist-packages/tornado/gen.py", line 792, in callback
result_list.append(f.result())
File "/usr/lib/python2.7/dist-packages/tornado/concurrent.py", line 237, in result
raise_exc_info(self._exc_info)
File "<string>", line 3, in raise_exc_info
IllegalStateError: IllegalStateError: Producer is closed forcefully.
May 31 14:24:47 kafka1002 eventlogging-service-eventbus[17492]: (MainThread) Failed processing event: Failed sending event <Event 5d5a6732-64de-11e8-8a60-141877614b03 of schema (u'mediawiki/job', 1)>. IllegalStateError: Producer is closed forcefully.Description
Details
| Subject | Repo | Branch | Lines +/- | |
|---|---|---|---|---|
| Make Kafka + eventlogging-service edge cases more resilient | eventlogging | master | +44 -19 |
Event Timeline
I think we should raise the priority of this issue. Not being able to deliver messages (and not retrying to do so) should be considered a serious operational problem, IMHO.
Agree! Ah the 'priority' here was kind of set accidentally to low. Phab will change the priority sometimes when we move it around in columns. I'll try to look at this today.
Change 436904 had a related patch set uploaded (by Ottomata; owner: Ottomata):
[eventlogging@master] Make Kafka + eventlogging-service edge cases more resilient
Change 436904 merged by Ottomata:
[eventlogging@master] Make Kafka + eventlogging-service edge cases more resilient
Mentioned in SAL (#wikimedia-operations) [2018-06-04T18:32:38Z] <otto@deploy1001> Started deploy [eventlogging/eventbus@3a5c395]: T196077
Mentioned in SAL (#wikimedia-operations) [2018-06-04T18:35:16Z] <ottomata> deploying eventlogging-service-eventbus for T196077
Mentioned in SAL (#wikimedia-operations) [2018-06-04T18:36:45Z] <otto@deploy1001> Finished deploy [eventlogging/eventbus@3a5c395]: T196077 (duration: 04m 07s)
Mentioned in SAL (#wikimedia-operations) [2018-06-04T18:44:48Z] <ottomata> bouncing kafka on kafka2003 to test T196077
Mentioned in SAL (#wikimedia-operations) [2018-06-04T19:00:32Z] <ottomata> bouncing kafka2003 again to test T196077 with python-kafka 1.4.3
Mentioned in SAL (#wikimedia-operations) [2018-06-04T19:04:45Z] <otto@deploy1001> Started restart [eventlogging/eventbus@3a5c395]: bouncing eventbus after upgrading to python-kafka 1.4.3 for T196077
Mentioned in SAL (#wikimedia-operations) [2018-06-04T19:06:45Z] <ottomata> bouncing kafka2003 one more time for T196077
Ok! I think this helped. Deploying my fix somehow exposed the underlying problem, which was https://github.com/dpkp/kafka-python/issues/1442. This has been fixed in kafka-python 1.4.3. I uploaded a new .deb and upgraded it. Now eventbus logs during leader elections look great!