Page MenuHomePhabricator

kafka-logging hosts running out of space on /srv
Closed, ResolvedPublic

Description

New hosts have 5TB for /srv, codfw hosts have 15TB. Appears to be gaining 1% d/d.

Event Timeline

Change 697995 had a related patch set uploaded (by Cwhite; author: Cwhite):

[operations/puppet@production] kafka-logging: reduce retention time to 5 days

https://gerrit.wikimedia.org/r/697995

Worth to point out that only two topics seems to be really big:

elukey@kafka-logging1001:/srv/kafka/data$ sudo du -hs -- * | sort -h | tail -n 20
24G	rsyslog-info-4
24G	rsyslog-info-5
38G	udp_localhost-info-0
38G	udp_localhost-info-1
39G	udp_localhost-info-2
39G	udp_localhost-info-3
39G	udp_localhost-info-4
39G	udp_localhost-info-5
281G	rsyslog-notice-0
281G	rsyslog-notice-1
281G	rsyslog-notice-3
281G	rsyslog-notice-4
281G	rsyslog-notice-5
282G	rsyslog-notice-2
388G	udp_localhost-warning-1
388G	udp_localhost-warning-3
388G	udp_localhost-warning-5
389G	udp_localhost-warning-0
389G	udp_localhost-warning-2
389G	udp_localhost-warning-4

I tried to tail the udp_localhost-warning with kafkacat -t udp_localhost-warning -b localhost:9092 -C and it shows a mediawiki log that repeats over and over, maybe it is something spammy that we could ask to turn off if not needed. For rsyslog-notice there seems to be a ton of logs from kubernetes nodes, maybe we could trim something from there.

Change 697995 merged by Cwhite:

[operations/puppet@production] kafka-logging: reduce retention time to 5 days

https://gerrit.wikimedia.org/r/697995

OO, we should make sure this isn't another case of https://phabricator.wikimedia.org/T250133#6063641

This may be the case. The global retention reduction had little effect. Inspecting further, I found Configs:retention.ms=432000000,retention.bytes=500000000000 on udp_localhost-(warning|info) from T250133#6063641. Thinking the custom configs to be the problem, I removed those configs. It did not seem to have an effect.

Then, I sought to reduce byte retention by setting retention.bytes=300000000000 (300gb) on udp_localhost-warning in the hopes that perhaps it simply needed a more aggressive retention policy. Doing this cleaned up far more than I expected by more than a TB (compare with T284233#7132254):

...
4.5G    udp_localhost-warning-4
4.6G    udp_localhost-warning-2
4.8G    udp_localhost-warning-0
5.0G    udp_localhost-warning-3
5.0G    udp_localhost-warning-5
5.1G    udp_localhost-warning-1
6.1G    rsyslog-warning-0
6.1G    rsyslog-warning-1
6.1G    rsyslog-warning-2
6.1G    rsyslog-warning-3
6.1G    rsyslog-warning-4
6.1G    rsyslog-warning-5
24G     rsyslog-info-0
24G     rsyslog-info-1
24G     rsyslog-info-2
24G     rsyslog-info-3
24G     rsyslog-info-4
24G     rsyslog-info-5
39G     udp_localhost-info-0
39G     udp_localhost-info-1
39G     udp_localhost-info-3
39G     udp_localhost-info-4
39G     udp_localhost-info-5
40G     udp_localhost-info-2
283G    rsyslog-notice-0
283G    rsyslog-notice-1
283G    rsyslog-notice-2
283G    rsyslog-notice-3
283G    rsyslog-notice-4
283G    rsyslog-notice-5

rsyslog-notice appears to have had the same problem.

382G	rsyslog-notice-0
382G	rsyslog-notice-1
382G	rsyslog-notice-2
382G	rsyslog-notice-3
382G	rsyslog-notice-4
382G	rsyslog-notice-5

After setting retention.bytes=300000000000 (300gb):

26G	rsyslog-notice-2
27G	rsyslog-notice-0
27G	rsyslog-notice-1
27G	rsyslog-notice-3
27G	rsyslog-notice-4
27G	rsyslog-notice-5
herron claimed this task.

Disk util on kafka-logging hosts has been stable for 70+ days now, resolving