Page MenuHomePhabricator

Upgrade python-kafka
Closed, ResolvedPublic3 Estimated Story Points

Description

We're currently running the latest version of python-kafka packaged for Debian, 1.4.3, which we seem to have backported ourselves to Stretch. We have seemingly hit a serious deadlock bug that version of python-kafka has, on the coal service: https://github.com/dpkp/kafka-python/issues/1590 It's possible that the other services we run that rely on that library would be vulnerable to that bug, if they are using consumer groups.

Looking at Puppet, it seems like the following services rely on that library and will need to be watched during the upgrade:

  • coal
  • navtiming
  • statsv
  • eventlogging
  • hadoop

Event Timeline

Gilles triaged this task as Medium priority.
Gilles created this task.
Gilles added a project: SRE.
Gilles added a subscriber: jijiki.

We can probably simply backport https://github.com/dpkp/kafka-python/pull/1628/commits/f12d4978e06c191871e092c190c2a34977f0c8bd on top of our 1.4.3 package

If we confirm that this fixes it and given that 1.4.3 is in buster we should also report it to Debian, there's no pre-existing bug in the Debian BTS.

Why not just upgrade to a newer version? Surely it's not the only bugfix in the past year.

I've packaged the latest stable (1.4.6) without issue on WMCS (stretch.thumbor.eqiad.wmflabs:/home/gilles/ is where the files are).

Assigning this to Andrew, who seems to be the person usually updating that package.

Built 1.4.6. .debs and added them to https://apt.wikimedia.org/wikimedia/pool/main/p/python-kafka/. @Gilles, can you do the upgrade of the webperf stuff? If that looks ok, we'll proceed with eventlogging.

Thanks!

It seems to be breaking navtiming (coal is fine, though):

Apr 30 07:43:37 webperf1001 python[5681]: 2019-04-30 07:43:37,515 [INFO] (run:614) Subscribing to topics: [u'eventlogging_PaintTiming', u'eventlogging_QuickSurveysResponses', u'eventlogging_NavigationTiming', u'eventlogging_SaveTiming']
Apr 30 07:43:37 webperf1001 python[5681]: 2019-04-30 07:43:37,516 [ERROR] (run:646) Unhandled exception in main loop, restarting consumer
Apr 30 07:43:37 webperf1001 python[5681]: Traceback (most recent call last):
Apr 30 07:43:37 webperf1001 python[5681]:   File "/srv/deployment/performance/navtiming-cache/revs/c2b2d1491c07e1af107c53e65b9863ea39adc641/navtiming/__init__.py", line 620, in run
Apr 30 07:43:37 webperf1001 python[5681]:     for p in partitions:
Apr 30 07:43:37 webperf1001 python[5681]: TypeError: 'NoneType' object is not iterable

But I can't seem to install the previous version of the package anymore, did you remove it from the Debian repo?:

gilles@webperf1001:~$ sudo apt-get install python-kafka=1.4.3-1~stretch1
Reading package lists... Done
Building dependency tree       
Reading state information... Done
E: Version '1.4.3-1~stretch1' for 'python-kafka' was not found

Change 507253 had a related patch set uploaded (by Gilles; owner: Gilles):
[performance/navtiming@master] Add more logging around startup

https://gerrit.wikimedia.org/r/507253

Change 507253 merged by jenkins-bot:
[performance/navtiming@master] Add more logging around startup

https://gerrit.wikimedia.org/r/507253

Mentioned in SAL (#wikimedia-operations) [2019-04-30T07:53:34Z] <gilles@deploy1001> Started deploy [performance/navtiming@e900152]: T221848 add more logging around startup

Mentioned in SAL (#wikimedia-operations) [2019-04-30T07:53:39Z] <gilles@deploy1001> Finished deploy [performance/navtiming@e900152]: T221848 add more logging around startup (duration: 00m 05s)

Change 507255 had a related patch set uploaded (by Gilles; owner: Gilles):
[performance/navtiming@master] Default to partition 0 when no partition is found

https://gerrit.wikimedia.org/r/507255

Change 507255 merged by jenkins-bot:
[performance/navtiming@master] Default to partition 0 when no partition is found

https://gerrit.wikimedia.org/r/507255

Mentioned in SAL (#wikimedia-operations) [2019-04-30T08:11:05Z] <gilles@deploy1001> Started deploy [performance/navtiming@8f135ac]: T221848 Defalt to partition 0 when no partition is found

Mentioned in SAL (#wikimedia-operations) [2019-04-30T08:11:09Z] <gilles@deploy1001> deploy aborted: T221848 Defalt to partition 0 when no partition is found (duration: 00m 00s)

Mentioned in SAL (#wikimedia-operations) [2019-04-30T08:11:16Z] <gilles@deploy1001> Started deploy [performance/navtiming@8f135ac]: T221848 Default to partition 0 when no partition is found

Mentioned in SAL (#wikimedia-operations) [2019-04-30T08:11:21Z] <gilles@deploy1001> Finished deploy [performance/navtiming@8f135ac]: T221848 Default to partition 0 when no partition is found (duration: 00m 05s)

Fixed navtiming for now, I'll investigate further to make sure that this is a proper fix and not a hack. Right now I'm not sure that the metadata call did happen before we call partitions_for_topic.

I've tracked down the root cause of the issue: https://github.com/dpkp/kafka-python/issues/1774

For other uses of python-kafka we have, we simply need to check if consumer.partitions_for_topic() is called. If it is, add a call to consumer.topics() first, to ensure that the metadata cache is primed. I think that in 1.4.3 our way of calling this method worked accidentally.

Change 507284 had a related patch set uploaded (by Gilles; owner: Gilles):
[performance/navtiming@master] Proper fix for partitions_for_topic in python-kafka > 1.4.4

https://gerrit.wikimedia.org/r/507284

Change 507284 merged by jenkins-bot:
[performance/navtiming@master] Proper fix for partitions_for_topic in python-kafka > 1.4.4

https://gerrit.wikimedia.org/r/507284

Mentioned in SAL (#wikimedia-operations) [2019-04-30T11:08:34Z] <gilles@deploy1001> Started deploy [performance/navtiming@d6756c0]: T221848 Proper fix for partitions_for_topic in python-kafka > 1.4.4

Mentioned in SAL (#wikimedia-operations) [2019-04-30T11:08:37Z] <gilles@deploy1001> Finished deploy [performance/navtiming@d6756c0]: T221848 Proper fix for partitions_for_topic in python-kafka > 1.4.4 (duration: 00m 05s)

@Ottomata all our services are good now, you can go ahead with upgrading EventLogging and Hadoop.

Mentioned in SAL (#wikimedia-analytics) [2019-04-30T13:25:11Z] <ottomata> restarting eventlogging processes to upgrade to python-kafka 1.4.6 - T221848

Ottomata moved this task from Ready to Deploy to Done on the Analytics-Kanban board.
Ottomata set the point value for this task to 3.

The upgrade of python-kafka to 1.4.6 on eventlog1002 coincides very well with T222941 :(

We can probably simply backport https://github.com/dpkp/kafka-python/pull/1628/commits/f12d4978e06c191871e092c190c2a34977f0c8bd on top of our 1.4.3 package

If we confirm that this fixes it and given that 1.4.3 is in buster we should also report it to Debian, there's no pre-existing bug in the Debian BTS.

Can we please restart from what Moritz pointed out originally? The issue that I am seeing on eventlog1002 seems related to https://github.com/dpkp/kafka-python/issues/1418, that is still not fixed afaics. I would be inclined to build 1.4.3 with the patch that Gilles pointed out, and then rollback 1.4.6.

Mentioned in SAL (#wikimedia-analytics) [2019-05-10T14:17:44Z] <ottomata> downgrading python-kafka from 1.4.6-1~stretch1 to 1.4.3-2~wmf0 on eventlog1002 - T221848

Update: we were running 1.4.1.-1~stretch1, I have rolled back eventlogging to it and all instabilities went away. 1.4.3 seems a broken version from the changelog, a lot of changes related to the consumer code and bugs introduced.

@Gilles it seems that coal is having problems as well:

Commandline: apt-get --only-upgrade install python-kafka
Requested-By: gilles (4319)
Upgrade: python-kafka:amd64 (1.4.3-1~stretch1, 1.4.6-1~stretch1)
End-Date: 2019-04-30  07:31:13

https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?from=now-30d&to=now&orgId=1&var-datasource=eqiad%20prometheus%2Fops&var-cluster=jumbo-eqiad&var-topic=All&var-consumer_group=coal_eqiad

Shall we rollback to 1.4.1-1~stretch1 everywhere? We could try to gather a gdb py-bt about what's happening and report it upstream on one host, but since the issue is intermittent it seems difficult.

I'd like to understand this bug better before rolling back the package for coal. It's not a big deal per se if coal is a little behind events. While there is "consumer lag" (is that the amount of uncommited events that are more recent than the last committed one by the client?), it doesn't seem to be growing. In practice when you go to https://performance.wikimedia.org/ the metrics (processed by coal) look quite up to date. If there is indeed real lag, it doesn't seem to be an amount that's impactful.

We can also try applying https://github.com/dpkp/kafka-python/pull/1628/commits/f12d4978e06c191871e092c190c2a34977f0c8bd to 1.4.1 and cross our fingers that it'll fix the deadlock we were experiencing with coal.

Closing this task. T222941: Eventlogging processors are frequently failing heartbeats causing consumer group rebalances is still an issue, but for now we will ensure that we don't accidentally upgrade to 1.4.6 on the eventlogging server.