Page MenuHomePhabricator

Out of disk space on multiple kafka-test brokers
Closed, ResolvedPublic

Description

We have seen an issue with multiple kafka-test brokers, where a single kafka topic became too large and filled the disk.

image.png (235×698 px, 30 KB)

image.png (913×1 px, 118 KB)

Additionally, while Icinga alerted about these hosts, it did not alert the Data Engineering team directly so the messages were not spotted in the IRC channel #wikimedia-operations prior to the disks becoming full.

Event Timeline

BTullis triaged this task as Unbreak Now! priority.Jun 10 2022, 9:51 AM
BTullis moved this task from Incoming (new tickets) to Ops Week on the Data-Engineering board.
BTullis moved this task from Next Up to In Progress on the Data-Engineering-Kanban board.

I have modified the topic to set the retention time to 1 second.

btullis@kafka-test1006:~$ kafka configs --entity-type topics --entity-name mediawiki.page_content_change --describe
kafka-configs --zookeeper zookeeper-test1002.eqiad.wmnet/kafka/test-eqiad --entity-type topics --entity-name mediawiki.page_content_change --describe
Configs for topic 'mediawiki.page_content_change' are

btullis@kafka-test1006:~$ kafka configs --alter --entity-type topics --entity-name mediawiki.page_content_change --add-config retention.ms=1000
kafka-configs --zookeeper zookeeper-test1002.eqiad.wmnet/kafka/test-eqiad --alter --entity-type topics --entity-name mediawiki.page_content_change --add-config retention.ms=1000
Completed Updating config for entity: topic 'mediawiki.page_content_change'.

btullis@kafka-test1006:~$ kafka configs --entity-type topics --entity-name mediawiki.page_content_change --describe
kafka-configs --zookeeper zookeeper-test1002.eqiad.wmnet/kafka/test-eqiad --entity-type topics --entity-name mediawiki.page_content_change --describe
Configs for topic 'mediawiki.page_content_change' are retention.ms=1000

I will verify that it purges the data, then remove this configuration.

I have now deleted the custom retention time.

btullis@kafka-test1006:~$ kafka configs --alter --entity-type topics --entity-name mediawiki.page_content_change --delete-config retention.ms
kafka-configs --zookeeper zookeeper-test1002.eqiad.wmnet/kafka/test-eqiad --alter --entity-type topics --entity-name mediawiki.page_content_change --delete-config retention.ms
Completed Updating config for entity: topic 'mediawiki.page_content_change'.

btullis@kafka-test1006:~$ kafka configs --entity-type topics --entity-name mediawiki.page_content_change --describe
kafka-configs --zookeeper zookeeper-test1002.eqiad.wmnet/kafka/test-eqiad --entity-type topics --entity-name mediawiki.page_content_change --describe
Configs for topic 'mediawiki.page_content_change' are

Brokers are all back working again. No under-replicated partitions. No partitions offline.
Grafana link

image.png (758×1 px, 145 KB)

BTullis lowered the priority of this task from Unbreak Now! to Medium.Jun 10 2022, 12:25 PM
BTullis moved this task from In Progress to Done on the Data-Engineering-Kanban board.

In order to free enough space for kafka to apply the new settings and purge the topic, I had to remove three old kernels from each broker with:

sudo apt purge linux-image-4.19.0-13-amd64 linux-image-4.19.0-14-amd64 linux-image-4.19.0-16-amd64

This left the currently running kernel, plus one previous version, on each of the five brokers.

Once I had cleared this space I could restart the kafka service and allow the topic pruning operation to run to completion.

Change 804573 had a related patch set uploaded (by Btullis; author: Btullis):

[operations/puppet@production] Decrease the retention time on the kafka-test cluster to 1 day

https://gerrit.wikimedia.org/r/804573

We have decided to reduce the retention time on the kafka-test cluster from its default value of 7 days to 1 day. That is what this patch is intended to do: https://gerrit.wikimedia.org/r/804573

Change 804573 merged by Btullis:

[operations/puppet@production] Decrease the retention time on the kafka-test cluster to 1 day

https://gerrit.wikimedia.org/r/804573