Page MenuHomePhabricator

Rotate kafka GC logs [3 pts] {hawk}
Closed, ResolvedPublic

Description

Kafka GC logs have filed kafka1012 and kafka1020 disks.
Letting @ema describing more the actions taken.

Event Timeline

JAllemandou raised the priority of this task from to Unbreak Now!.
JAllemandou updated the task description. (Show Details)
JAllemandou added subscribers: JAllemandou, ema.

/var/log/kafka/kafkaServer-gc.log is not properly rotated on kafka nodes. The root partition of kafka1020 and kafka1020 was full, leading to critical icinga alerts.

After investigating the issue with _joe_ we discovered that a few gc-related logging options could be used to rotate such files properly.

I have thus added KAFKA_OPTS="-XX:GCLogFileSize=1M" to /etc/default/kafka on kafka1020, restared the kafka service, and confirmed that kafkaServer-gc.log got properly truncated, solving the disk space issue.

After a few minutes I've added KAFKA_OPTS="-XX:GCLogFileSize=50M -XX:+UseGCLogFileRotation -XX:NumberOfGCLogFiles=5" to /etc/default/kafka on kafka1012 to fix the same issue and also test proper log rotation rather than simply truncating the file. I've then restarted the kafka service on kafka1012 as well, observed that a new file called kafkaServer-gc.log.0.current was created and kafkaServer-gc.log was still on disk filling up the root partition. After checking that kafkaServer-gc.log was not used by the java process, I've truncated it to reclaim disk space.

See the following code reviews for the details of pending puppet changes:
https://gerrit.wikimedia.org/r/#/c/266203/
https://gerrit.wikimedia.org/r/#/c/266209/

The puppet changes have been merged. I have re-enabled puppet on kafka1020 and kafka1012.

On kafka1018, where the size of kafkaServer-gc.log was also starting to become a problem, I have restarted kafka after puppet modified /etc/default/kafka and truncated kafkaServer-gc.log. @JAllemandou restarted eventlogging shortly afterwards.

elukey lowered the priority of this task from Unbreak Now! to High.Jan 25 2016, 2:51 PM

1014, 1018 and 1022 Kafka brokers were restarted and cleaned from the huge logs.

Last action was to run kafka preferred-replica-election to rebalance the partition leaders.

EL was restarted again.

Milimetric renamed this task from Rotate kafka GC logs to Rotate kafka GC logs [3 pts].Jan 25 2016, 5:44 PM
Milimetric moved this task from Next Up to In Code Review on the Analytics-Kanban board.
Milimetric renamed this task from Rotate kafka GC logs [3 pts] to Rotate kafka GC logs [3 pts] {hawk}.Feb 2 2016, 5:14 PM
Milimetric assigned this task to elukey.