Tune Kafka logs to register clients connected
Closed, ResolvedPublic8 Story Points

Description

Our Kafka alarms are currently not offering any way of figuring out what clients (producers/consumers) are connected and what is their IP address.

In T172681 this would have been really useful to trace the faulty producer back to rhenium.wikimedia.org, rather than having to restart a broker with verbose logging.

Since we are introducing Kafka ACLs with the new Jumbo cluster we could simply tune the kafka-authorizer.log (I did it in labs when testing and it was quite handy).

elukey created this task.Aug 17 2017, 10:09 AM
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptAug 17 2017, 10:09 AM
mforns moved this task from Incoming to Q4 (April 2018) on the Analytics board.Aug 17 2017, 3:19 PM
elukey moved this task from Backlog to Analytics Backlog on the User-Elukey board.
elukey edited projects, added Analytics-Kanban; removed Analytics.Sep 5 2017, 1:23 PM
elukey moved this task from Next Up to In Progress on the Analytics-Kanban board.
elukey added a comment.Sep 5 2017, 1:53 PM

Tuning the kafka-authorizer appender is definitely important for us since it contains interesting info like:

[2017-09-05 13:39:32,147] DEBUG Principal = User:ANONYMOUS is Denied Operation = Describe from host = 10.68.22.62 on resource = Topic:__confluent.support.metrics (kafka.authorizer.logger)
[2017-09-05 13:50:59,698] DEBUG operation = Describe on resource = Topic:elukey2 from host = 10.68.22.62 is Allow based on acl = User:CN=client1,OU=Services,O=WMF,C=US has Allow permission for operations: Describe from hosts: * (kafka.authorizer.logger)
[2017-09-05 13:50:59,698] DEBUG Principal = User:CN=client1,OU=Services,O=WMF,C=US is Allowed Operation = Describe from host = 10.68.22.62 on resource = Topic:elukey2 (kafka.authorizer.logger)

It doesn't show more detailed information about the kafka client (like api-version used, etc..) but the most important ones are there, like IP address and type of operation.

Change 376015 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] confluent::kafka: set kafka-authorizer log to DEBUG

https://gerrit.wikimedia.org/r/376015

Mentioned in SAL (#wikimedia-operations) [2017-09-06T11:24:57Z] <elukey> temporarily raise kafka log4j authorizer verbosity to DEBUG on kafka1012 - T173493

Change 376015 abandoned by Elukey:
confluent::kafka: set kafka-authorizer log to DEBUG

Reason:
The patch didn't work on Kafka analytics since we need to enable a parameter to allow ACLs to be processed before getting any data on the authorizer.log. I'll try to come up with a new patch for the kafka jumbo cluster.

https://gerrit.wikimedia.org/r/376015

Change 381980 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] Enable basic ACL handling on the Kafka Jumbo cluster

https://gerrit.wikimedia.org/r/381980

elukey added a comment.Oct 4 2017, 5:24 PM

Added the following ACLs (still not active since the above patch is not merged):

elukey@kafka-jumbo1001:~$ kafka acls --list
kafka-acls --authorizer-properties zookeeper.connect=conf1001.eqiad.wmnet,conf1002.eqiad.wmnet,conf1003.eqiad.wmnet/kafka/jumbo-eqiad --list
Current ACLs for resource `Group:*`:
 	User:ANONYMOUS has Allow permission for operations: Read from hosts: *

Current ACLs for resource `Topic:*`:
 	User:ANONYMOUS has Allow permission for operations: Describe from hosts: *
	User:ANONYMOUS has Allow permission for operations: Write from hosts: *
	User:ANONYMOUS has Allow permission for operations: Read from hosts: *

Current ACLs for resource `Cluster:kafka-cluster`:
 	User:ANONYMOUS has Allow permission for operations: Create from hosts: *
	User:ANONYMOUS has Allow permission for operations: All from hosts: *

Change 381980 merged by Elukey:
[operations/puppet@production] Enable basic ACL handling on the Kafka Jumbo cluster

https://gerrit.wikimedia.org/r/381980

Mentioned in SAL (#wikimedia-operations) [2017-10-04T17:37:29Z] <elukey> enabled basic ACLs on the Kafka Jumbo cluster - T173493

elukey moved this task from In Code Review to Done on the Analytics-Kanban board.Oct 5 2017, 7:57 AM
elukey moved this task from In Progress to Done on the User-Elukey board.Oct 5 2017, 8:27 AM
elukey set the point value for this task to 8.Oct 5 2017, 3:07 PM
Nuria closed this task as Resolved.Oct 9 2017, 4:39 PM