Maniphest T196077

EventBus service can drop a few messages during kafka leadership change
Closed, ResolvedPublic5 Estimated Story Points
Actions

Assigned To

Authored By

	Ottomata
	May 31 2018, 2:38 PM

Description

May 31 14:24:47 kafka1002 eventlogging-service-eventbus[17492]: (MainThread) Failed sending event <Event 5d5a6732-64de-11e8-8a60-141877614b03 of schema (u'mediawiki/job', 1)>. IllegalStateError: Producer is closed forcefully.
Traceback (most recent call last):
File "/srv/deployment/eventlogging/eventbus/eventlogging/service.py", line 263, in handle_events
    yield self.process_event(event)
  File "/usr/lib/python2.7/dist-packages/tornado/gen.py", line 1015, in run
    value = future.result()
  File "/usr/lib/python2.7/dist-packages/tornado/concurrent.py", line 237, in result
    raise_exc_info(self._exc_info)
  File "/usr/lib/python2.7/dist-packages/tornado/gen.py", line 1021, in run
    yielded = self.gen.throw(*exc_info)
  File "/srv/deployment/eventlogging/eventbus/eventlogging/service.py", line 242, in process_event
    yield self.send(event)
  File "/usr/lib/python2.7/dist-packages/tornado/gen.py", line 1015, in run
    value = future.result()
  File "/usr/lib/python2.7/dist-packages/tornado/concurrent.py", line 237, in result
    raise_exc_info(self._exc_info)
  File "/usr/lib/python2.7/dist-packages/tornado/gen.py", line 792, in callback
    result_list.append(f.result())
  File "/usr/lib/python2.7/dist-packages/tornado/concurrent.py", line 237, in result
    raise_exc_info(self._exc_info)
  File "<string>", line 3, in raise_exc_info
  IllegalStateError: IllegalStateError: Producer is closed forcefully.
  
May 31 14:24:47 kafka1002 eventlogging-service-eventbus[17492]: (MainThread) Failed processing event: Failed sending event <Event 5d5a6732-64de-11e8-8a60-141877614b03 of schema (u'mediawiki/job', 1)>. IllegalStateError: Producer is closed forcefully.

Details

	Subject	Repo	Branch	Lines +/-
	Make Kafka + eventlogging-service edge cases more resilient	eventlogging	master	+44 -19

Customize query in gerrit

Event Timeline

Ottomata created this task.May 31 2018, 2:38 PM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptMay 31 2018, 2:38 PM

• Pchelolo added a project: Services (watching).May 31 2018, 2:39 PM

Ottomata triaged this task as Medium priority.May 31 2018, 4:35 PM

Ottomata set the point value for this task to 5.

• Nuria lowered the priority of this task from Medium to Low.May 31 2018, 4:35 PM

• Nuria moved this task from Incoming to Operational Excellence on the Analytics board.

I think we should raise the priority of this issue. Not being able to deliver messages (and not retrying to do so) should be considered a serious operational problem, IMHO.

elukey subscribed.Jun 1 2018, 3:21 PM

Agree! Ah the 'priority' here was kind of set accidentally to low. Phab will change the priority sometimes when we move it around in columns. I'll try to look at this today.

Ottomata raised the priority of this task from Low to Medium.Jun 1 2018, 4:50 PM

Change 436904 had a related patch set uploaded (by Ottomata; owner: Ottomata):
[eventlogging@master] Make Kafka + eventlogging-service edge cases more resilient

https://gerrit.wikimedia.org/r/436904

gerritbot added a project: Patch-For-Review.Jun 1 2018, 9:06 PM

Change 436904 merged by Ottomata:
[eventlogging@master] Make Kafka + eventlogging-service edge cases more resilient

https://gerrit.wikimedia.org/r/436904

Ottomata added a project: Analytics-Kanban.Jun 4 2018, 3:06 PM

Ottomata moved this task from Next Up to In Code Review on the Analytics-Kanban board.

Mentioned in SAL (#wikimedia-operations) [2018-06-04T18:32:38Z] <otto@deploy1001> Started deploy [eventlogging/eventbus@3a5c395]: T196077

Mentioned in SAL (#wikimedia-operations) [2018-06-04T18:35:16Z] <ottomata> deploying eventlogging-service-eventbus for T196077

Mentioned in SAL (#wikimedia-operations) [2018-06-04T18:36:45Z] <otto@deploy1001> Finished deploy [eventlogging/eventbus@3a5c395]: T196077 (duration: 04m 07s)

Mentioned in SAL (#wikimedia-operations) [2018-06-04T18:44:48Z] <ottomata> bouncing kafka on kafka2003 to test T196077

Mentioned in SAL (#wikimedia-operations) [2018-06-04T19:00:32Z] <ottomata> bouncing kafka2003 again to test T196077 with python-kafka 1.4.3

Mentioned in SAL (#wikimedia-operations) [2018-06-04T19:04:45Z] <otto@deploy1001> Started restart [eventlogging/eventbus@3a5c395]: bouncing eventbus after upgrading to python-kafka 1.4.3 for T196077

Mentioned in SAL (#wikimedia-operations) [2018-06-04T19:06:45Z] <ottomata> bouncing kafka2003 one more time for T196077

Ok! I think this helped. Deploying my fix somehow exposed the underlying problem, which was https://github.com/dpkp/kafka-python/issues/1442. This has been fixed in kafka-python 1.4.3. I uploaded a new .deb and upgraded it. Now eventbus logs during leader elections look great!

Ottomata moved this task from In Code Review to Done on the Analytics-Kanban board.Jun 4 2018, 7:11 PM

• Nuria closed this task as Resolved.Jun 25 2018, 11:15 PM

• Vvjjkkii renamed this task from EventBus service can drop a few messages during kafka leadership change to qwbaaaaaaa.Jul 1 2018, 1:06 AM

• Vvjjkkii reopened this task as Open.

• Vvjjkkii removed Ottomata as the assignee of this task.

• Vvjjkkii raised the priority of this task from Medium to High.

• Vvjjkkii added projects: CheckUser, Connected-Open-Heritage-Batch-uploads (RAÄ-KMB_1_2017-02), Tamil-Sites, Gamepress, Hashtags, Jade, KartoEditor, Language-2018-Apr-June, New-Editor-Experiences, Mail, TCB-Team (now WMDE-TechWish).

• Vvjjkkii updated the task description. (Show Details)

• Vvjjkkii removed the point value for this task.

• Vvjjkkii removed subscribers: gerritbot, Aklapper.

• Community_Tech_bot renamed this task from qwbaaaaaaa to EventBus service can drop a few messages during kafka leadership change.Jul 1 2018, 6:25 AM

• Community_Tech_bot closed this task as Resolved.

• Community_Tech_bot assigned this task to Ottomata.

• Community_Tech_bot set the point value for this task to 5.

• Community_Tech_bot updated the task description. (Show Details)

• Community_Tech_bot removed projects: TCB-Team (now WMDE-TechWish), Mail, New-Editor-Experiences, Language-2018-Apr-June, KartoEditor, Jade, Hashtags, Gamepress, Tamil-Sites, Connected-Open-Heritage-Batch-uploads (RAÄ-KMB_1_2017-02), CheckUser.

• Community_Tech_bot added subscribers: gerritbot, Aklapper.

CommunityTechBot lowered the priority of this task from High to Medium.Jul 3 2018, 3:23 AM

EventBus service can drop a few messages during kafka leadership changeClosed, ResolvedPublic5 Estimated Story PointsActions

Description

Details

Event Timeline

EventBus service can drop a few messages during kafka leadership change
Closed, ResolvedPublic5 Estimated Story Points
Actions