Page MenuHomePhabricator

Eventlogging processors are frequently failing heartbeats causing consumer group rebalances
Closed, ResolvedPublic8 Story Points

Description

It seems that since 2019-04-30 at ~13:40 UTC the Eventlogging processors started to fail consumer group heartbeats with the Kafka broker leader of the cgroup, lagging periodically:

https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?from=now-30d&to=now&orgId=1&var-datasource=eqiad%20prometheus%2Fops&var-cluster=jumbo-eqiad&var-topic=All&var-consumer_group=eventlogging_consumer_client_side_events_log_00

Event Timeline

elukey triaged this task as High priority.May 10 2019, 9:17 AM
elukey created this task.
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptMay 10 2019, 9:17 AM

I tried with the following changes:

https://gerrit.wikimedia.org/r/#/c/509341/
https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/509344/
https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/509346/

To see if tuning kafka python's parameters (assuming that I did it right in the puppet code) could alleviate the problem, without any success. My theory is/was that something changed in the time taken by processors to poll() from Kafka, causing the consumers in the cgroup to fail health checks with the broker periodically and in turn causing rebalances. There should be a separate thread for heartbeats in kafka python 1.4.6, but due to Python's GIL, I suspect that it may be delayed when long poll() happens.

SAL for the 04-30:

https://tools.wmflabs.org/sal/production?p=0&q=&d=2019-04-30
https://tools.wmflabs.org/sal/analytics?d=2019-04-30

2019-04-30
13:25	<ottomata>	restarting eventlogging processes to upgrade to python-kafka 1.4.6 - T221848

Perfect match, python-kafka 1.4.6 is likely the root cause.

https://github.com/dpkp/kafka-python/issues/1418 seems related, some workarounds are listed.

Rebuilt python-kafka_1.4.3-1_all.deb and uploaded to eventlog1002 in case we decide to rollback.

Change 509403 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] profile::eventlogging::analytics::server: remove unneeded max_poll_records

https://gerrit.wikimedia.org/r/509403

Change 509403 merged by Elukey:
[operations/puppet@production] profile::eventlogging::analytics::server: remove unneeded max_poll_records

https://gerrit.wikimedia.org/r/509403

elukey lowered the priority of this task from High to Normal.May 10 2019, 3:34 PM

Andrew deployed 1.4.3 and we are back to stable.

Mentioned in SAL (#wikimedia-operations) [2019-05-11T06:37:52Z] <elukey> restart eventlogging on eventlog1002 - huge kafka consumer lag accumulated (T222941)

There was a big lag accumulated during the EU nightime, after the EL restart it seems working fine. Errors were all like:

May 11 03:21:00 eventlog1002 eventlogging-processor@client-side-00[18305]: 2019-05-11 03:21:00,349 [18305] (eventlogging_processor_client_side_00-heartbeat) kafka.coordinator [WARNING] Marking the coordinator dead (node 1001) for group eventlogging_processor_client_side_00: Heartbeat session expired.
May 11 03:21:00 eventlog1002 eventlogging-processor@client-side-00[18305]: 2019-05-11 03:21:00,386 [18305] (MainThread) kafka.coordinator.consumer [WARNING] Auto offset commit failed for group eventlogging_processor_client_side_00: CommitFailedError: Commit cannot be completed since the group has already
May 11 03:21:00 eventlog1002 eventlogging-processor@client-side-00[18305]:             rebalanced and assigned the partitions to another member.
May 11 03:21:00 eventlog1002 eventlogging-processor@client-side-00[18305]:             This means that the time between subsequent calls to poll()
May 11 03:21:00 eventlog1002 eventlogging-processor@client-side-00[18305]:             was longer than the configured max_poll_interval_ms, which
May 11 03:21:00 eventlog1002 eventlogging-processor@client-side-00[18305]:             typically implies that the poll loop is spending too much
May 11 03:21:00 eventlog1002 eventlogging-processor@client-side-00[18305]:             time message processing. You can address this either by
May 11 03:21:00 eventlog1002 eventlogging-processor@client-side-00[18305]:             increasing the rebalance timeout with max_poll_interval_ms,
May 11 03:21:00 eventlog1002 eventlogging-processor@client-side-00[18305]:             or by reducing the maximum size of batches returned in poll()
May 11 03:21:00 eventlog1002 eventlogging-processor@client-side-00[18305]:             with max_poll_records.
May 11 03:21:00 eventlog1002 eventlogging-processor@client-side-00[18305]:
May 11 03:21:00 eventlog1002 eventlogging-processor@client-side-00[18305]: 2019-05-11 03:21:00,421 [18305] (MainThread) kafka.cluster [INFO] Group coordinator for eventlogging_processor_client_side_00 is BrokerMetadata(nodeId=1001, host=u'kafka-jumbo1001.eqiad.wmnet', port=9092, rack=None)
May 11 03:21:00 eventlog1002 eventlogging-processor@client-side-00[18305]: 2019-05-11 03:21:00,422 [18305] (MainThread) kafka.coordinator [INFO] Discovered coordinator 1001 for group eventlogging_processor_client_side_00

But this time, all the processors got stuck in the above and failed to keep processing. It already happened once IIRC, but maybe the 1.4.3 version that we rolled back to may not work as expected.

Change 509566 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] profile::prometheus::alerts: add EL processors kafka consumer lag alert

https://gerrit.wikimedia.org/r/509566

Change 509566 merged by Elukey:
[operations/puppet@production] profile::prometheus::alerts: add EL processors kafka consumer lag alert

https://gerrit.wikimedia.org/r/509566

Change 509567 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] profile::prometheus::alerts: tune EL kafka consumer lag

https://gerrit.wikimedia.org/r/509567

Change 509567 merged by Elukey:
[operations/puppet@production] profile::prometheus::alerts: tune EL kafka consumer lag

https://gerrit.wikimedia.org/r/509567

Change 509568 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] profile::prometheus::alerts: fix dashboard URL

https://gerrit.wikimedia.org/r/509568

Change 509568 merged by Elukey:
[operations/puppet@production] profile::prometheus::alerts: fix dashboard URL

https://gerrit.wikimedia.org/r/509568

elukey added a comment.EditedMay 12 2019, 12:13 PM

Again all processors stuck, py-bt for reference related to one processor:

(gdb) thread apply all py-bt

Thread 18 (Thread 0x7f119cff9700 (LWP 19210)):
Traceback (most recent call first):
  Waiting for the GIL
  File "/usr/lib/python2.7/threading.py", line 174, in acquire
    rc = self.__block.acquire(blocking)
  File "/usr/lib/python2.7/threading.py", line 286, in __enter__
    return self.__lock.__enter__()
  File "/usr/lib/python2.7/dist-packages/kafka/coordinator/base.py", line 276, in lookup_coordinator
    with self._client._lock, self._lock:
  File "/usr/lib/python2.7/dist-packages/kafka/coordinator/base.py", line 952, in _run_once
    future = self.coordinator.lookup_coordinator()
  File "/usr/lib/python2.7/dist-packages/kafka/coordinator/base.py", line 914, in run
    self._run_once()
  File "/usr/lib/python2.7/threading.py", line 801, in __bootstrap_inner
    self.run()
  File "/usr/lib/python2.7/threading.py", line 774, in __bootstrap
    self.__bootstrap_inner()

Thread 17 (Thread 0x7f119d7fa700 (LWP 19129)):
Traceback (most recent call first):

Thread 16 (Thread 0x7f119dffb700 (LWP 19125)):
Traceback (most recent call first):

Thread 15 (Thread 0x7f119e7fc700 (LWP 19123)):
Traceback (most recent call first):

Thread 14 (Thread 0x7f119effd700 (LWP 19121)):
Traceback (most recent call first):

Thread 13 (Thread 0x7f119f7fe700 (LWP 19118)):
Traceback (most recent call first):

Thread 12 (Thread 0x7f119ffff700 (LWP 19117)):
Traceback (most recent call first):

Thread 11 (Thread 0x7f11a8ff9700 (LWP 19116)):
---Type <return> to continue, or q <return> to quit---
Traceback (most recent call first):

Thread 10 (Thread 0x7f11a97fa700 (LWP 19114)):
Traceback (most recent call first):

Thread 9 (Thread 0x7f11a9ffb700 (LWP 19056)):
Traceback (most recent call first):

Thread 8 (Thread 0x7f11aa7fc700 (LWP 19044)):
Traceback (most recent call first):

Thread 7 (Thread 0x7f11aaffd700 (LWP 19041)):
Traceback (most recent call first):

Thread 6 (Thread 0x7f11ab7fe700 (LWP 19036)):
Traceback (most recent call first):

Thread 5 (Thread 0x7f11abfff700 (LWP 19032)):
Traceback (most recent call first):

Thread 4 (Thread 0x7f11b0ca4700 (LWP 19031)):
Traceback (most recent call first):

Thread 3 (Thread 0x7f11b14a5700 (LWP 19025)):
Traceback (most recent call first):

Thread 2 (Thread 0x7f11b1ca6700 (LWP 19023)):
Traceback (most recent call first):

Thread 1 (Thread 0x7f11b82e2700 (LWP 18956)):
Traceback (most recent call first):
  Waiting for the GIL
  File "/usr/lib/python2.7/threading.py", line 174, in acquire
    rc = self.__block.acquire(blocking)
  File "/usr/lib/python2.7/dist-packages/kafka/coordinator/base.py", line 246, in ensure_coordinator_ready
    with self._client._lock, self._lock:
  File "/usr/lib/python2.7/dist-packages/kafka/coordinator/base.py", line 355, in ensure_active_group
    self.ensure_coordinator_ready()
---Type <return> to continue, or q <return> to quit---
  File "/usr/lib/python2.7/dist-packages/kafka/coordinator/consumer.py", line 281, in poll
    self.ensure_active_group()
  File "/usr/lib/python2.7/dist-packages/kafka/consumer/group.py", line 1052, in _message_generator
    self._coordinator.poll()
  File "/usr/lib/python2.7/dist-packages/kafka/consumer/group.py", line 1118, in __next__
    return next(self._iterator)
  File "/usr/lib/python2.7/dist-packages/kafka/vendor/six.py", line 561, in next
    return type(self).__next__(self)
  File "/srv/deployment/eventlogging/analytics/eventlogging/handlers.py", line 884, in <genexpr>
    return stream((message.value for message in kafka_consumer), raw)
  File "/srv/deployment/eventlogging/analytics/eventlogging/streams.py", line 94, in <genexpr>
    return (line.decode('utf-8', 'replace') for line in stream)
  File "/srv/deployment/eventlogging/analytics/bin/eventlogging-processor", line 131, in <module>
    for raw_event in get_reader(args.input):

Mentioned in SAL (#wikimedia-operations) [2019-05-12T12:14:19Z] <elukey> restart eventlogging on eventlog1002 - all processors stuck due to kafka python (T222941)

@Ottomata sorry but 1.4.3 was not the right version to rollback to :(

Start-Date: 2019-04-30  13:24:25
Commandline: apt-get install python-kafka
Requested-By: otto (2129)
Upgrade: python-kafka:amd64 (1.4.1-1~stretch1, 1.4.6-1~stretch1)
End-Date: 2019-04-30  13:24:29

The package is still on boron, I'd just rollback to it and see how it goes.

Mentioned in SAL (#wikimedia-operations) [2019-05-12T15:32:43Z] <elukey> rollback python-kafka one eventlog1002 to 1.4.1-1~stretch1 - T222941

Nuria moved this task from Next Up to Paused on the Analytics-Kanban board.May 14 2019, 7:41 PM

Change 536300 had a related patch set uploaded (by Ottomata; owner: Ottomata):
[operations/puppet@production] Ensure python-kafka at 1.4.1 on eventlogging server

https://gerrit.wikimedia.org/r/536300

Status: 1.4.1 is on eventlog1002, 1.4.6 is in apt and used by coal (IIUC).

We aren't going to solve https://github.com/dpkp/kafka-python/issues/1418 now, so to keep eventlogging from accidentally upgrading, we are ensuring the version to be 1.4.1-1~stretch1 in puppet. This is a hacky workaround, but we don't have time to work on this right now, and in the future the bug may get fixed, so we will wait.

Change 536300 merged by Ottomata:
[operations/puppet@production] Ensure python-kafka at 1.4.1 on eventlogging server

https://gerrit.wikimedia.org/r/536300

Change 536557 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] eventlogging:dependencies: allow to specify python-kafka version

https://gerrit.wikimedia.org/r/536557

Change 536557 merged by Elukey:
[operations/puppet@production] eventlogging:dependencies: allow to specify python-kafka version

https://gerrit.wikimedia.org/r/536557

Change 536561 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] eventlogging: move comment about python-kafka version

https://gerrit.wikimedia.org/r/536561

Change 536561 merged by Elukey:
[operations/puppet@production] eventlogging: move comment about python-kafka version

https://gerrit.wikimedia.org/r/536561

Is puppet on deployment-eventlog05.deployment-prep.eqiad.wmflabs expected to be failing with errors about E: Version '1.4.1-1~stretch1' for 'python-kafka' was not found? I notice that version is hard-coded in puppet but absent from the apt repositories in use on that instance.

Is puppet on deployment-eventlog05.deployment-prep.eqiad.wmflabs expected to be failing with errors about E: Version '1.4.1-1~stretch1' for 'python-kafka' was not found? I notice that version is hard-coded in puppet but absent from the apt repositories in use on that instance.

Should be fixed now, I just manually installed the package on the vm. This is currently a hack to avoid issues if eventlog1002 needs to be reimaged/re-create elsewhere, we cannot upgrade python-kafka there yet since when we tried it showed the problem described in https://github.com/dpkp/kafka-python/issues/1418.

elukey added a comment.Thu, Oct 3, 8:58 AM

It seems that 1.4.7 is out https://github.com/dpkp/kafka-python/issues/1418#issuecomment-536756917

A user reported that it didn't fix the issue, but we could package/deploy it and report back.

elukey added a comment.Fri, Oct 4, 6:31 AM

I just packaged and deployed 1.4.7 to deployment-eventlog05. Let's see if it works fine in there, and possibly upgrade eventlog1002?

Change 540915 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] profile::eventlogging::analytics::server: allow to specify python-kafka version

https://gerrit.wikimedia.org/r/540915

Change 540915 merged by Elukey:
[operations/puppet@production] profile::eventlogging::analytics::server: allow to specify python-kafka version

https://gerrit.wikimedia.org/r/540915

Mentioned in SAL (#wikimedia-operations) [2019-10-07T06:08:18Z] <elukey> upgrade python-kafka on eventlog1002 to 1.4.7-1 (manually via dpkg -i) - T222941

elukey claimed this task.Mon, Oct 7, 6:19 AM
elukey edited projects, added Analytics-Kanban; removed Patch-For-Review.
elukey moved this task from Paused to In Progress on the Analytics-Kanban board.

Change 541196 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] eventlogging::analytics: remove specific python-kafka constraints

https://gerrit.wikimedia.org/r/541196

Change 541196 merged by Elukey:
[operations/puppet@production] eventlogging::analytics: remove specific python-kafka constraints

https://gerrit.wikimedia.org/r/541196

Mentioned in SAL (#wikimedia-operations) [2019-10-07T12:54:30Z] <elukey> upload python-kafka and python3-kafka 1.4.7-1 to stretch-wikimedia - T222941

Mentioned in SAL (#wikimedia-operations) [2019-10-07T13:13:48Z] <elukey> upload python-kafka and python3-kafka 1.4.7-1 to buster-wikimedia - T222941

elukey moved this task from Backlog to In Progress on the User-Elukey board.Mon, Oct 7, 1:22 PM
elukey added a comment.Tue, Oct 8, 5:02 AM

All good after 24 hours, it seems that 1.4.7 is the good one!

elukey changed the point value for this task from 0 to 8.Tue, Oct 8, 5:02 AM
elukey moved this task from In Progress to Done on the Analytics-Kanban board.
elukey moved this task from In Progress to Done on the User-Elukey board.Wed, Oct 9, 9:33 AM
Nuria closed this task as Resolved.Wed, Oct 9, 3:48 PM