Page MenuHomePhabricator

EventBus service can drop a few messages during kafka leadership change
Closed, ResolvedPublic5 Estimated Story Points

Description

May 31 14:24:47 kafka1002 eventlogging-service-eventbus[17492]: (MainThread) Failed sending event <Event 5d5a6732-64de-11e8-8a60-141877614b03 of schema (u'mediawiki/job', 1)>. IllegalStateError: Producer is closed forcefully.
Traceback (most recent call last):
File "/srv/deployment/eventlogging/eventbus/eventlogging/service.py", line 263, in handle_events
    yield self.process_event(event)
  File "/usr/lib/python2.7/dist-packages/tornado/gen.py", line 1015, in run
    value = future.result()
  File "/usr/lib/python2.7/dist-packages/tornado/concurrent.py", line 237, in result
    raise_exc_info(self._exc_info)
  File "/usr/lib/python2.7/dist-packages/tornado/gen.py", line 1021, in run
    yielded = self.gen.throw(*exc_info)
  File "/srv/deployment/eventlogging/eventbus/eventlogging/service.py", line 242, in process_event
    yield self.send(event)
  File "/usr/lib/python2.7/dist-packages/tornado/gen.py", line 1015, in run
    value = future.result()
  File "/usr/lib/python2.7/dist-packages/tornado/concurrent.py", line 237, in result
    raise_exc_info(self._exc_info)
  File "/usr/lib/python2.7/dist-packages/tornado/gen.py", line 792, in callback
    result_list.append(f.result())
  File "/usr/lib/python2.7/dist-packages/tornado/concurrent.py", line 237, in result
    raise_exc_info(self._exc_info)
  File "<string>", line 3, in raise_exc_info
  IllegalStateError: IllegalStateError: Producer is closed forcefully.
  
May 31 14:24:47 kafka1002 eventlogging-service-eventbus[17492]: (MainThread) Failed processing event: Failed sending event <Event 5d5a6732-64de-11e8-8a60-141877614b03 of schema (u'mediawiki/job', 1)>. IllegalStateError: Producer is closed forcefully.

Event Timeline

Ottomata triaged this task as Medium priority.May 31 2018, 4:35 PM
Ottomata set the point value for this task to 5.
Nuria lowered the priority of this task from Medium to Low.May 31 2018, 4:35 PM
Nuria moved this task from Incoming to Operational Excellence on the Analytics board.

I think we should raise the priority of this issue. Not being able to deliver messages (and not retrying to do so) should be considered a serious operational problem, IMHO.

Agree! Ah the 'priority' here was kind of set accidentally to low. Phab will change the priority sometimes when we move it around in columns. I'll try to look at this today.

Ottomata raised the priority of this task from Low to Medium.Jun 1 2018, 4:50 PM

Change 436904 had a related patch set uploaded (by Ottomata; owner: Ottomata):
[eventlogging@master] Make Kafka + eventlogging-service edge cases more resilient

https://gerrit.wikimedia.org/r/436904

Change 436904 merged by Ottomata:
[eventlogging@master] Make Kafka + eventlogging-service edge cases more resilient

https://gerrit.wikimedia.org/r/436904

Mentioned in SAL (#wikimedia-operations) [2018-06-04T18:32:38Z] <otto@deploy1001> Started deploy [eventlogging/eventbus@3a5c395]: T196077

Mentioned in SAL (#wikimedia-operations) [2018-06-04T18:35:16Z] <ottomata> deploying eventlogging-service-eventbus for T196077

Mentioned in SAL (#wikimedia-operations) [2018-06-04T18:36:45Z] <otto@deploy1001> Finished deploy [eventlogging/eventbus@3a5c395]: T196077 (duration: 04m 07s)

Mentioned in SAL (#wikimedia-operations) [2018-06-04T18:44:48Z] <ottomata> bouncing kafka on kafka2003 to test T196077

Mentioned in SAL (#wikimedia-operations) [2018-06-04T19:00:32Z] <ottomata> bouncing kafka2003 again to test T196077 with python-kafka 1.4.3

Mentioned in SAL (#wikimedia-operations) [2018-06-04T19:04:45Z] <otto@deploy1001> Started restart [eventlogging/eventbus@3a5c395]: bouncing eventbus after upgrading to python-kafka 1.4.3 for T196077

Mentioned in SAL (#wikimedia-operations) [2018-06-04T19:06:45Z] <ottomata> bouncing kafka2003 one more time for T196077

Ok! I think this helped. Deploying my fix somehow exposed the underlying problem, which was https://github.com/dpkp/kafka-python/issues/1442. This has been fixed in kafka-python 1.4.3. I uploaded a new .deb and upgraded it. Now eventbus logs during leader elections look great!

Vvjjkkii renamed this task from EventBus service can drop a few messages during kafka leadership change to qwbaaaaaaa.Jul 1 2018, 1:06 AM
Vvjjkkii reopened this task as Open.
Vvjjkkii removed Ottomata as the assignee of this task.
Vvjjkkii raised the priority of this task from Medium to High.
Vvjjkkii updated the task description. (Show Details)
Vvjjkkii removed the point value for this task.
Vvjjkkii removed subscribers: gerritbot, Aklapper.
Community_Tech_bot renamed this task from qwbaaaaaaa to EventBus service can drop a few messages during kafka leadership change.Jul 1 2018, 6:25 AM
Community_Tech_bot closed this task as Resolved.
Community_Tech_bot assigned this task to Ottomata.
Community_Tech_bot set the point value for this task to 5.
Community_Tech_bot updated the task description. (Show Details)
CommunityTechBot lowered the priority of this task from High to Medium.Jul 3 2018, 3:23 AM