Page MenuHomePhabricator

Change or upgrade eventlogging kafka client used for consumption
Closed, ResolvedPublic21 Estimated Story Points

Description

15:53  <elukey> !log restarted kafka on kafka1018.eqiad.wmnet for Java upgrades
[...]
15:58  <icinga-wm> PROBLEM - Check status of defined EventLogging jobs on eventlog1001 is CRITICAL: CRITICAL: Stopped EventLogging jobs: forwarder/legacy-zmq

On eventlogging1001:

elukey@eventlog1001:~$ dpkg -l python-pykafka
ii  python-pykafka                          2.2.0-2~trusty1          amd64                    Full-Featured Pure-Python Kafka Client``

Logs saved in /home/elukey/eventlogging_forwarder-legacy-zmq.log, but I can see a lot of:

2016-04-27 13:53:09,251 (MainThread) RequestHandler.stop: about to flush requests queue
2016-04-27 13:53:09,251 (MainThread) Discovered 6 brokers
Traceback (most recent call last):
  File "/srv/deployment/eventlogging/analytics/bin/eventlogging-forwarder", line 54, in <module>
    input_stream = get_reader(args.input)
  File "/srv/deployment/eventlogging/analytics/eventlogging/factory.py", line 122, in get_reader
    iterator = handle(_readers, uri)
  File "/srv/deployment/eventlogging/analytics/eventlogging/factory.py", line 89, in handle
    return apply_safe(handler, kwargs)
  File "/srv/deployment/eventlogging/analytics/eventlogging/factory.py", line 73, in apply_safe
    result = f(*args, **kwargs)
  File "/srv/deployment/eventlogging/analytics/eventlogging/handlers.py", line 492, in kafka_reader
    kafka_client = PyKafkaClient(hosts=brokers)
  File "/usr/lib/python2.7/dist-packages/pykafka/client.py", line 88, in __init__
    zookeeper_hosts=zookeeper_hosts
  File "/usr/lib/python2.7/dist-packages/pykafka/cluster.py", line 194, in __init__
    self.update()
  File "/usr/lib/python2.7/dist-packages/pykafka/cluster.py", line 418, in update
    self._update_brokers(metadata.brokers)
  File "/usr/lib/python2.7/dist-packages/pykafka/cluster.py", line 343, in _update_brokers
    source_port=self._source_port
  File "/usr/lib/python2.7/dist-packages/pykafka/broker.py", line 140, in from_metadata
    source_port=source_port)
  File "/usr/lib/python2.7/dist-packages/pykafka/broker.py", line 92, in __init__
    self.connect()
  File "/usr/lib/python2.7/dist-packages/pykafka/broker.py", line 200, in connect
    self._connection.connect(self._socket_timeout_ms)
  File "/usr/lib/python2.7/dist-packages/pykafka/connection.py", line 82, in connect
    (self.source_host, self.source_port)
  File "/usr/lib/python2.7/socket.py", line 571, in create_connection
    raise err
socket.error: [Errno 111] Connection refused

Related Objects

Event Timeline

Mentioned in SAL [2016-04-28T08:12:46Z] <elukey> restarting kafka on kafka{1012,1014,1022,1020,2001,2002} for Java upgrades. Will probably trigger some EventLogging alarms due to a bug (T133779)

Change 292049 had a related patch set uploaded (by Ottomata):
[PoC] Replace pykafka with latest (and snazzy!) kafka-python

https://gerrit.wikimedia.org/r/292049

Tested using kafka-python in prod today while restarting a broker. The pykafka processes died with

2016-06-01 18:17:09,165 (Thread-19 ) Updating cluster in response to error in fetch() for broker id 18                                                                                                  <=>                ]
2016-06-01 18:17:09,213 (Thread-19 ) RequestHandler.stop: about to flush requests queue
2016-06-01 18:17:09,216 (Thread-19 ) Discovered 6 brokers
2016-06-01 18:17:09,216 (Thread-19 ) Broker kafka1012.eqiad.wmnet:9092 metadata unchanged. Continuing.
2016-06-01 18:17:09,216 (Thread-19 ) Broker kafka1013.eqiad.wmnet:9092 metadata unchanged. Continuing.
2016-06-01 18:17:09,216 (Thread-19 ) Broker kafka1014.eqiad.wmnet:9092 metadata unchanged. Continuing.
2016-06-01 18:17:09,217 (Thread-19 ) Reconnecting to broker id 18: kafka1018.eqiad.wmnet:9092
2016-06-01 18:17:09,218 (Thread-19 ) Failed to re-establish connection with broker id 18: kafka1018.eqiad.wmnet:9092
2016-06-01 18:17:09,218 (Thread-19 ) Broker kafka1020.eqiad.wmnet:9092 metadata unchanged. Continuing.
2016-06-01 18:17:09,218 (Thread-19 ) Broker kafka1022.eqiad.wmnet:9092 metadata unchanged. Continuing.
2016-06-01 18:17:09,219 (Thread-19 ) Discovered 166 topics
2016-06-01 18:17:09,220 (Thread-19 ) Adding 12 partitions
2016-06-01 18:17:09,220 (Thread-19 ) Updating leader for <pykafka.partition.Partition at 0x7f2583ec88d0 (id=0)> from broker 18 to broker 12
2016-06-01 18:17:09,220 (Thread-19 ) Updating in sync replicas list for <pykafka.partition.Partition at 0x7f2583ec88d0 (id=0)>
2016-06-01 18:17:09,220 (Thread-19 ) Updating leader for <pykafka.partition.Partition at 0x7f2583ec8910 (id=3)> from broker 18 to broker 14
2016-06-01 18:17:09,220 (Thread-19 ) Updating in sync replicas list for <pykafka.partition.Partition at 0x7f2583ec8910 (id=3)>
2016-06-01 18:17:09,220 (Thread-19 ) Updating in sync replicas list for <pykafka.partition.Partition at 0x7f2583ec87d0 (id=5)>
2016-06-01 18:17:09,221 (Thread-19 ) Updating in sync replicas list for <pykafka.partition.Partition at 0x7f2583ec8ed0 (id=8)>
2016-06-01 18:17:09,221 (Thread-19 ) Updating in sync replicas list for <pykafka.partition.Partition at 0x7f2583ebb190 (id=10)>
2016-06-01 18:17:09,221 (Thread-19 ) Attempting to discover offset manager for consumer group 'otto-pykafka1'
2016-06-01 18:17:09,259 (Thread-19 ) Found coordinator broker with id 13
2016-06-01 18:17:10,176 (Thread-21 ) RequestHandler worker: exiting cleanly                                                                                                                                 <=>            ]
2016-06-01 18:17:10,251 (MainThread) Exception encountered in worker thread:
  File "/usr/lib/python2.7/dist-packages/pykafka/simpleconsumer.py", line 346, in fetcher
    self.fetch()
  File "/usr/lib/python2.7/dist-packages/pykafka/simpleconsumer.py", line 711, in fetch
    success_handler=_handle_success)
  File "/usr/lib/python2.7/dist-packages/pykafka/utils/error_handlers.py", line 50, in handle_partition_responses
    error_handlers[errcode](parts)

Traceback (most recent call last):
  File "bin/eventlogging-processor", line 123, in <module>
    for raw_event in get_reader(args.input):
  File "/home/otto/eventlogging/eventlogging/streams.py", line 94, in <genexpr>
    return (line.decode('utf-8', 'replace') for line in stream)
  File "/home/otto/eventlogging/eventlogging/handlers.py", line 512, in <genexpr>
    return stream((message.value for message in message_stream(consumer)), raw)
  File "/home/otto/eventlogging/eventlogging/handlers.py", line 510, in message_stream
    yield consumer.consume()
  File "/usr/lib/python2.7/dist-packages/pykafka/balancedconsumer.py", line 708, in consume
    message = self._consumer.consume(block=block)
  File "/usr/lib/python2.7/dist-packages/pykafka/simpleconsumer.py", line 381, in consume
    self._raise_worker_exceptions()
  File "/usr/lib/python2.7/dist-packages/pykafka/simpleconsumer.py", line 205, in _raise_worker_exceptions
    raise ex
KeyError: -1
2016-06-01 18:17:10,320 (Thread-11 ) Sending request(xid=132): Close()
2016-06-01 18:17:10,332 (Thread-11 ) Received EVENT: Watch(type=4, state=3, path=u'/kafka/eqiad/consumers/otto-pykafka1/ids')
2016-06-01 18:17:10,333 (Thread-11 ) Closing connection to conf1001.eqiad.wmnet:2181
2016-06-01 18:17:10,333 (Thread-11 ) Zookeeper session lost, state: CLOSED

While the kafka-python processes did

2016-06-01 18:17:09,056 (MainThread) <BrokerConnection host=kafka1018.eqiad.wmnet/2620:0:861:108:10:64:53:10 port=9092>: socket disconnected
2016-06-01 18:17:09,056 (MainThread) Node 18 connection failed -- refreshing metadata
2016-06-01 18:17:09,057 (MainThread) Fetch to node 18 failed: socket disconnected
2016-06-01 18:17:09,057 (MainThread) Connect attempt to <BrokerConnection host=kafka1018.eqiad.wmnet/2620:0:861:108:10:64:53:10 port=9092> returned error 111. Disconnecting.
2016-06-01 18:17:09,058 (MainThread) Node 18 connection failed -- refreshing metadata
2016-06-01 18:17:09,333 (MainThread) Connect attempt to <BrokerConnection host=kafka1018.eqiad.wmnet/2620:0:861:108:10:64:53:10 port=9092> returned error 111. Disconnecting.
2016-06-01 18:17:09,334 (MainThread) Node 18 connection failed -- refreshing metadata
2016-06-01 18:17:09,364 (MainThread) Connect attempt to <BrokerConnection host=kafka1018.eqiad.wmnet/2620:0:861:108:10:64:53:10 port=9092> returned error 111. Disconnecting.
2016-06-01 18:17:09,364 (MainThread) Node 18 connection failed -- refreshing metadata
2016-06-01 18:17:09,378 (MainThread) Connect attempt to <BrokerConnection host=kafka1018.eqiad.wmnet/2620:0:861:108:10:64:53:10 port=9092> returned error 111. Disconnecting.
2016-06-01 18:17:09,378 (MainThread) Node 18 connection failed -- refreshing metadata
2016-06-01 18:17:09,396 (MainThread) Connect attempt to <BrokerConnection host=kafka1018.eqiad.wmnet/2620:0:861:108:10:64:53:10 port=9092> returned error 111. Disconnecting.
2016-06-01 18:17:09,397 (MainThread) Node 18 connection failed -- refreshing metadata
2016-06-01 18:17:09,625 (MainThread) Connect attempt to <BrokerConnection host=kafka1018.eqiad.wmnet/2620:0:861:108:10:64:53:10 port=9092> returned error 111. Disconnecting.                                              ]
2016-06-01 18:17:09,626 (MainThread) Node 18 connection failed -- refreshing metadata
2016-06-01 18:17:09,628 (MainThread) Connect attempt to <BrokerConnection host=kafka1018.eqiad.wmnet/2620:0:861:108:10:64:53:10 port=9092> returned error 111. Disconnecting.
2016-06-01 18:17:09,628 (MainThread) Node 18 connection failed -- refreshing metadata
2016-06-01 18:17:09,673 (MainThread) Connect attempt to <BrokerConnection host=kafka1018.eqiad.wmnet/2620:0:861:108:10:64:53:10 port=9092> returned error 111. Disconnecting.
2016-06-01 18:17:09,673 (MainThread) Node 18 connection failed -- refreshing metadata
2016-06-01 18:17:09,685 (MainThread) Connect attempt to <BrokerConnection host=kafka1018.eqiad.wmnet/2620:0:861:108:10:64:53:10 port=9092> returned error 111. Disconnecting.
2016-06-01 18:17:09,686 (MainThread) Node 18 connection failed -- refreshing metadata
2016-06-01 18:17:09,700 (MainThread) Connect attempt to <BrokerConnection host=kafka1018.eqiad.wmnet/2620:0:861:108:10:64:53:10 port=9092> returned error 111. Disconnecting.
2016-06-01 18:17:09,700 (MainThread) Node 18 connection failed -- refreshing metadata
2016-06-01 18:17:09,721 (MainThread) Connect attempt to <BrokerConnection host=kafka1018.eqiad.wmnet/2620:0:861:108:10:64:53:10 port=9092> returned error 111. Disconnecting.
2016-06-01 18:17:09,721 (MainThread) Node 18 connection failed -- refreshing metadata
2016-06-01 18:17:09,765 (MainThread) Connect attempt to <BrokerConnection host=kafka1018.eqiad.wmnet/2620:0:861:108:10:64:53:10 port=9092> returned error 111. Disconnecting.
2016-06-01 18:17:09,765 (MainThread) Node 18 connection failed -- refreshing metadata
2016-06-01 18:17:09,768 (MainThread) Connect attempt to <BrokerConnection host=kafka1018.eqiad.wmnet/2620:0:861:108:10:64:53:10 port=9092> returned error 111. Disconnecting.
2016-06-01 18:17:09,768 (MainThread) Node 18 connection failed -- refreshing metadata
2016-06-01 18:17:09,862 (MainThread) Connect attempt to <BrokerConnection host=kafka1018.eqiad.wmnet/2620:0:861:108:10:64:53:10 port=9092> returned error 111. Disconnecting.
2016-06-01 18:17:09,862 (MainThread) Node 18 connection failed -- refreshing metadata
2016-06-01 18:17:09,989 (MainThread) Connect attempt to <BrokerConnection host=kafka1018.eqiad.wmnet/2620:0:861:108:10:64:53:10 port=9092> returned error 111. Disconnecting.
2016-06-01 18:17:09,990 (MainThread) Node 18 connection failed -- refreshing metadata
2016-06-01 18:17:09,998 (MainThread) Connect attempt to <BrokerConnection host=kafka1018.eqiad.wmnet/2620:0:861:108:10:64:53:10 port=9092> returned error 111. Disconnecting.
2016-06-01 18:17:09,998 (MainThread) Node 18 connection failed -- refreshing metadata
2016-06-01 18:17:10,013 (MainThread) Connect attempt to <BrokerConnection host=kafka1018.eqiad.wmnet/2620:0:861:108:10:64:53:10 port=9092> returned error 111. Disconnecting.
2016-06-01 18:17:10,013 (MainThread) Node 18 connection failed -- refreshing metadata
2016-06-01 18:17:10,059 (MainThread) Connect attempt to <BrokerConnection host=kafka1018.eqiad.wmnet/2620:0:861:108:10:64:53:10 port=9092> returned error 111. Disconnecting.
2016-06-01 18:17:10,059 (MainThread) Node 18 connection failed -- refreshing metadata
2016-06-01 18:17:10,178 (MainThread) Connect attempt to <BrokerConnection host=kafka1018.eqiad.wmnet/2620:0:861:108:10:64:53:10 port=9092> returned error 111. Disconnecting.
2016-06-01 18:17:10,178 (MainThread) Node 18 connection failed -- refreshing metadata
2016-06-01 18:17:10,180 (MainThread) Connect attempt to <BrokerConnection host=kafka1018.eqiad.wmnet/2620:0:861:108:10:64:53:10 port=9092> returned error 111. Disconnecting.
2016-06-01 18:17:10,180 (MainThread) Node 18 connection failed -- refreshing metadata
2016-06-01 18:17:10,202 (MainThread) Connect attempt to <BrokerConnection host=kafka1018.eqiad.wmnet/2620:0:861:108:10:64:53:10 port=9092> returned error 111. Disconnecting.
2016-06-01 18:17:10,202 (MainThread) Node 18 connection failed -- refreshing metadata
2016-06-01 18:17:10,227 (MainThread) Connect attempt to <BrokerConnection host=kafka1018.eqiad.wmnet/2620:0:861:108:10:64:53:10 port=9092> returned error 111. Disconnecting.
2016-06-01 18:17:10,228 (MainThread) Node 18 connection failed -- refreshing metadata
2016-06-01 18:17:10,374 (MainThread) Connect attempt to <BrokerConnection host=kafka1018.eqiad.wmnet/2620:0:861:108:10:64:53:10 port=9092> returned error 111. Disconnecting.
2016-06-01 18:17:10,374 (MainThread) Node 18 connection failed -- refreshing metadata
2016-06-01 18:17:10,398 (MainThread) Connect attempt to <BrokerConnection host=kafka1018.eqiad.wmnet/2620:0:861:108:10:64:53:10 port=9092> returned error 111. Disconnecting.
2016-06-01 18:17:10,398 (MainThread) Node 18 connection failed -- refreshing metadata
2016-06-01 18:17:10,450 (MainThread) Connect attempt to <BrokerConnection host=kafka1018.eqiad.wmnet/2620:0:861:108:10:64:53:10 port=9092> returned error 111. Disconnecting.
2016-06-01 18:17:10,451 (MainThread) Node 18 connection failed -- refreshing metadata
2016-06-01 18:17:10,460 (MainThread) Connect attempt to <BrokerConnection host=kafka1018.eqiad.wmnet/2620:0:861:108:10:64:53:10 port=9092> returned error 111. Disconnecting.
2016-06-01 18:17:10,461 (MainThread) Node 18 connection failed -- refreshing metadata
2016-06-01 18:17:10,832 (MainThread) Connect attempt to <BrokerConnection host=kafka1018.eqiad.wmnet/2620:0:861:108:10:64:53:10 port=9092> returned error 111. Disconnecting.                                              ]
2016-06-01 18:17:10,832 (MainThread) Node 18 connection failed -- refreshing metadata
2016-06-01 18:17:10,843 (MainThread) Connect attempt to <BrokerConnection host=kafka1018.eqiad.wmnet/2620:0:861:108:10:64:53:10 port=9092> returned error 111. Disconnecting.
2016-06-01 18:17:10,843 (MainThread) Node 18 connection failed -- refreshing metadata
2016-06-01 18:17:10,900 (MainThread) Connect attempt to <BrokerConnection host=kafka1018.eqiad.wmnet/2620:0:861:108:10:64:53:10 port=9092> returned error 111. Disconnecting.
2016-06-01 18:17:10,900 (MainThread) Node 18 connection failed -- refreshing metadata
2016-06-01 18:17:10,926 (MainThread) Connect attempt to <BrokerConnection host=kafka1018.eqiad.wmnet/2620:0:861:108:10:64:53:10 port=9092> returned error 111. Disconnecting.
2016-06-01 18:17:10,926 (MainThread) Node 18 connection failed -- refreshing metadata
2016-06-01 18:17:10,945 (MainThread) Connect attempt to <BrokerConnection host=kafka1018.eqiad.wmnet/2620:0:861:108:10:64:53:10 port=9092> returned error 111. Disconnecting.
2016-06-01 18:17:10,946 (MainThread) Node 18 connection failed -- refreshing metadata
2016-06-01 18:17:10,987 (MainThread) Connect attempt to <BrokerConnection host=kafka1018.eqiad.wmnet/2620:0:861:108:10:64:53:10 port=9092> returned error 111. Disconnecting.
2016-06-01 18:17:10,987 (MainThread) Node 18 connection failed -- refreshing metadata
2016-06-01 18:17:11,000 (MainThread) Connect attempt to <BrokerConnection host=kafka1018.eqiad.wmnet/2620:0:861:108:10:64:53:10 port=9092> returned error 111. Disconnecting.
2016-06-01 18:17:11,001 (MainThread) Node 18 connection failed -- refreshing metadata
2016-06-01 18:17:11,004 (MainThread) Connect attempt to <BrokerConnection host=kafka1018.eqiad.wmnet/2620:0:861:108:10:64:53:10 port=9092> returned error 111. Disconnecting.
2016-06-01 18:17:11,005 (MainThread) Node 18 connection failed -- refreshing metadata
2016-06-01 18:17:11,134 (MainThread) Connect attempt to <BrokerConnection host=kafka1018.eqiad.wmnet/2620:0:861:108:10:64:53:10 port=9092> returned error 111. Disconnecting.
2016-06-01 18:17:11,135 (MainThread) Node 18 connection failed -- refreshing metadata
2016-06-01 18:17:11,158 (MainThread) Connect attempt to <BrokerConnection host=kafka1018.eqiad.wmnet/2620:0:861:108:10:64:53:10 port=9092> returned error 111. Disconnecting.
2016-06-01 18:17:11,159 (MainThread) Node 18 connection failed -- refreshing metadata
2016-06-01 18:17:11,203 (MainThread) Connect attempt to <BrokerConnection host=kafka1018.eqiad.wmnet/2620:0:861:108:10:64:53:10 port=9092> returned error 111. Disconnecting.
2016-06-01 18:17:11,203 (MainThread) Node 18 connection failed -- refreshing metadata
2016-06-01 18:17:11,225 (MainThread) Connect attempt to <BrokerConnection host=kafka1018.eqiad.wmnet/2620:0:861:108:10:64:53:10 port=9092> returned error 111. Disconnecting.
2016-06-01 18:17:11,225 (MainThread) Node 18 connection failed -- refreshing metadata
2016-06-01 18:17:11,276 (MainThread) Unknown error fetching data for topic-partition TopicPartition(topic=u'eventlogging-client-side', partition=0)
2016-06-01 18:17:11,314 (MainThread) Unknown error fetching data for topic-partition TopicPartition(topic=u'eventlogging-client-side', partition=0)
2016-06-01 18:17:11,355 (MainThread) Unknown error fetching data for topic-partition TopicPartition(topic=u'eventlogging-client-side', partition=0)
2016-06-01 18:17:11,369 (MainThread) Connect attempt to <BrokerConnection host=kafka1018.eqiad.wmnet/2620:0:861:108:10:64:53:10 port=9092> returned error 111. Disconnecting.
2016-06-01 18:17:11,369 (MainThread) Node 18 connection failed -- refreshing metadata
2016-06-01 18:17:11,391 (MainThread) Connect attempt to <BrokerConnection host=kafka1018.eqiad.wmnet/2620:0:861:108:10:64:53:10 port=9092> returned error 111. Disconnecting.
2016-06-01 18:17:11,391 (MainThread) Node 18 connection failed -- refreshing metadata
2016-06-01 18:17:11,397 (MainThread) Connect attempt to <BrokerConnection host=kafka1018.eqiad.wmnet/2620:0:861:108:10:64:53:10 port=9092> returned error 111. Disconnecting.
2016-06-01 18:17:11,397 (MainThread) Node 18 connection failed -- refreshing metadata
2016-06-01 18:17:11,398 (MainThread) Connect attempt to <BrokerConnection host=kafka1018.eqiad.wmnet/2620:0:861:108:10:64:53:10 port=9092> returned error 111. Disconnecting.
2016-06-01 18:17:11,398 (MainThread) Node 18 connection failed -- refreshing metadata
2016-06-01 18:17:11,399 (MainThread) Unknown error fetching data for topic-partition TopicPartition(topic=u'eventlogging-client-side', partition=0)
2016-06-01 18:17:11,400 (MainThread) Connect attempt to <BrokerConnection host=kafka1018.eqiad.wmnet/2620:0:861:108:10:64:53:10 port=9092> returned error 111. Disconnecting.
2016-06-01 18:17:11,400 (MainThread) Node 18 connection failed -- refreshing metadata
2016-06-01 18:17:11,439 (MainThread) Connect attempt to <BrokerConnection host=kafka1018.eqiad.wmnet/2620:0:861:108:10:64:53:10 port=9092> returned error 111. Disconnecting.
2016-06-01 18:17:11,440 (MainThread) Node 18 connection failed -- refreshing metadata
2016-06-01 18:17:11,572 (MainThread) Unknown error fetching data for topic-partition TopicPartition(topic=u'eventlogging-client-side', partition=0)                                                                        ]
2016-06-01 18:17:11,842 (MainThread) Connect attempt to <BrokerConnection host=kafka1018.eqiad.wmnet/2620:0:861:108:10:64:53:10 port=9092> returned error 111. Disconnecting.
2016-06-01 18:17:11,842 (MainThread) Node 18 connection failed -- refreshing metadata
2016-06-01 18:17:11,859 (MainThread) Connect attempt to <BrokerConnection host=kafka1018.eqiad.wmnet/2620:0:861:108:10:64:53:10 port=9092> returned error 111. Disconnecting.
2016-06-01 18:17:11,859 (MainThread) Node 18 connection failed -- refreshing metadata
2016-06-01 18:17:11,943 (MainThread) Connect attempt to <BrokerConnection host=kafka1018.eqiad.wmnet/2620:0:861:108:10:64:53:10 port=9092> returned error 111. Disconnecting.
2016-06-01 18:17:11,943 (MainThread) Node 18 connection failed -- refreshing metadata
2016-06-01 18:17:12,319 (MainThread) Connect attempt to <BrokerConnection host=kafka1018.eqiad.wmnet/2620:0:861:108:10:64:53:10 port=9092> returned error 111. Disconnecting.
2016-06-01 18:17:12,320 (MainThread) Node 18 connection failed -- refreshing metadata
2016-06-01 18:17:12,667 (MainThread) Connect attempt to <BrokerConnection host=kafka1018.eqiad.wmnet/2620:0:861:108:10:64:53:10 port=9092> returned error 111. Disconnecting.                                              ]
2016-06-01 18:17:12,667 (MainThread) Node 18 connection failed -- refreshing metadata
2016-06-01 18:17:12,884 (MainThread) Connect attempt to <BrokerConnection host=kafka1018.eqiad.wmnet/2620:0:861:108:10:64:53:10 port=9092> returned error 111. Disconnecting.
2016-06-01 18:17:12,884 (MainThread) Node 18 connection failed -- refreshing metadata
2016-06-01 18:17:12,917 (MainThread) Connect attempt to <BrokerConnection host=kafka1018.eqiad.wmnet/2620:0:861:108:10:64:53:10 port=9092> returned error 111. Disconnecting.
2016-06-01 18:17:12,917 (MainThread) Node 18 connection failed -- refreshing metadata
2016-06-01 18:17:12,919 (MainThread) Connect attempt to <BrokerConnection host=kafka1018.eqiad.wmnet/2620:0:861:108:10:64:53:10 port=9092> returned error 111. Disconnecting.
2016-06-01 18:17:12,919 (MainThread) Node 18 connection failed -- refreshing metadata
2016-06-01 18:17:12,958 (MainThread) Connect attempt to <BrokerConnection host=kafka1018.eqiad.wmnet/2620:0:861:108:10:64:53:10 port=9092> returned error 111. Disconnecting.
2016-06-01 18:17:12,958 (MainThread) Node 18 connection failed -- refreshing metadata
2016-06-01 18:17:12,959 (MainThread) Connect attempt to <BrokerConnection host=kafka1018.eqiad.wmnet/2620:0:861:108:10:64:53:10 port=9092> returned error 111. Disconnecting.
2016-06-01 18:17:12,960 (MainThread) Node 18 connection failed -- refreshing metadata
2016-06-01 18:17:13,015 (MainThread) Connect attempt to <BrokerConnection host=kafka1018.eqiad.wmnet/2620:0:861:108:10:64:53:10 port=9092> returned error 111. Disconnecting.
2016-06-01 18:17:13,015 (MainThread) Node 18 connection failed -- refreshing metadata

And then kept working.

WELLL POOP SCOOPS.

  1. kafka-python 1.2.0 doesn't really have a non-async mode anymore, and waiting for the produce response future to resolve causes a pretty serious performance regression. There is legacy interface support that uses the old kafka-python (the one we are using for producing now), but it also performs poorly. I tried to trace some code to figure out why, but wasn't getting anywhere.
  1. pykafka 2.4.0 now has Kafka based consumer group management, so I tried that out. Alas, it also failed during a broker restart. See: https://github.com/Parsely/pykafka/issues/543#issuecomment-223405662

Not quite sure what to do. I could hackily fix this by keeping kafka-python 0.9.4 (what we use now) for producing, and somehow also installing kafka-python 1.2.0 for consumption.

Change 292755 had a related patch set uploaded (by Ottomata):
[WIP][PoC] Use confluent-kafka python instead of kafka-python and pykafka

https://gerrit.wikimedia.org/r/292755

confluent-kafka seems to work A-ok! It passed the production kafka broker bounce test.

Am running 2 processors in a screen on stat1002, will leave them overnight to see how they do.

Ottomata set the point value for this task to 21.Jun 7 2016, 4:06 PM

Change 293273 had a related patch set uploaded (by Ottomata):
Include librdkafka-dev in contint::packages::python

https://gerrit.wikimedia.org/r/293273

Change 293273 merged by Ottomata:
Include librdkafka-dev in contint::packages::python

https://gerrit.wikimedia.org/r/293273

Mentioned in SAL [2016-07-06T14:47:20Z] <hashar_> attempting to refresh ci-jessie-wikimedia image to get librdkafka-dev included for T133779

Mentioned in SAL [2016-07-06T14:54:57Z] <hashar> Image ci-jessie-wikimedia-1467816381 in wmflabs-eqiad is ready T133779

Change 292755 merged by Ottomata:
Use confluent-kafka python instead of pykafka for consumption

https://gerrit.wikimedia.org/r/292755

Change 298000 had a related patch set uploaded (by Ottomata):
Update eventlogging kafka consumer args to match with python-confluent-kafka consumer

https://gerrit.wikimedia.org/r/298000

Deployed and running in beta!

Kafka restart tests in beta look good. librdkafka logging gets a little verbose when a broker is down. It prints a connection refused message about once a second. A little annoying, but nothing functionally wrong.

Change 300934 had a related patch set uploaded (by Ottomata):
Revert "Use confluent-kafka python instead of pykafka for consumption"

https://gerrit.wikimedia.org/r/300934

I'm reverting my confluent-kafka-python consumer change for now. It is not working in beta, and we want to do a prod deploy for eventbus soon.

https://gerrit.wikimedia.org/r/#/c/300934/1

Change 300934 merged by Ottomata:
Revert "Use confluent-kafka python instead of pykafka for consumption"

https://gerrit.wikimedia.org/r/300934

Ottomata renamed this task from Event Logging doesn't handle kafka nodes restart cleanly to Change or upgrade eventlogging kafka client used for consumption.Jul 28 2016, 4:49 PM

Change 302453 had a related patch set uploaded (by Ottomata):
Hieraize kafka handler for eventlogging analytics, select appropriate auto_offset_reset

https://gerrit.wikimedia.org/r/302453

Change 302453 merged by Ottomata:
Hieraize kafka handler for eventlogging analytics, select appropriate auto_offset_reset

https://gerrit.wikimedia.org/r/302453

Change 292049 merged by Ottomata:
Upgrade kafka writer to kafka-python 1.2.5 , add new kafka-python:// reader to test

https://gerrit.wikimedia.org/r/292049

kafka-python 1.2.5 has been running on deployment-eventlogging03 for the client side processor, and the all-events and mysql consumers for several days now. I just tested with a couple of different kafka failure scenarioes, and eventlogging kept on humming. That's good!

https://github.com/dpkp/kafka-python/pull/779 has been merged, which was dpkp's goal before making a new tag. Once it is made, I'll build a new .deb and deploy that. If that runs smoothly for a day, I think we should deploy to eventlogging analytics prod.

Change 303544 had a related patch set uploaded (by Ottomata):
Remove extraneous eventlogging configuration

https://gerrit.wikimedia.org/r/303544

Change 303544 merged by Ottomata:
Remove extraneous eventlogging configuration

https://gerrit.wikimedia.org/r/303544

Ottomata triaged this task as Medium priority.Aug 8 2016, 7:52 PM

Change 298000 abandoned by Ottomata:
Update eventlogging kafka consumer args to match with python-confluent-kafka consumer

Reason:
We didn't use the confluent client

https://gerrit.wikimedia.org/r/298000