New hosts have 5TB for /srv, codfw hosts have 15TB. Appears to be gaining 1% d/d.
Description
Details
| Subject | Repo | Branch | Lines +/- | |
|---|---|---|---|---|
| kafka-logging: reduce retention time to 5 days | operations/puppet | production | +1 -0 |
Related Objects
Event Timeline
Change 697995 had a related patch set uploaded (by Cwhite; author: Cwhite):
[operations/puppet@production] kafka-logging: reduce retention time to 5 days
Worth to point out that only two topics seems to be really big:
elukey@kafka-logging1001:/srv/kafka/data$ sudo du -hs -- * | sort -h | tail -n 20 24G rsyslog-info-4 24G rsyslog-info-5 38G udp_localhost-info-0 38G udp_localhost-info-1 39G udp_localhost-info-2 39G udp_localhost-info-3 39G udp_localhost-info-4 39G udp_localhost-info-5 281G rsyslog-notice-0 281G rsyslog-notice-1 281G rsyslog-notice-3 281G rsyslog-notice-4 281G rsyslog-notice-5 282G rsyslog-notice-2 388G udp_localhost-warning-1 388G udp_localhost-warning-3 388G udp_localhost-warning-5 389G udp_localhost-warning-0 389G udp_localhost-warning-2 389G udp_localhost-warning-4
I tried to tail the udp_localhost-warning with kafkacat -t udp_localhost-warning -b localhost:9092 -C and it shows a mediawiki log that repeats over and over, maybe it is something spammy that we could ask to turn off if not needed. For rsyslog-notice there seems to be a ton of logs from kubernetes nodes, maybe we could trim something from there.
Change 697995 merged by Cwhite:
[operations/puppet@production] kafka-logging: reduce retention time to 5 days
OO, we should make sure this isn't another case of https://phabricator.wikimedia.org/T250133#6063641
If it is T282887: Avoid accepting Kafka messages with whacky timestamps would fix this from happening again.
This may be the case. The global retention reduction had little effect. Inspecting further, I found Configs:retention.ms=432000000,retention.bytes=500000000000 on udp_localhost-(warning|info) from T250133#6063641. Thinking the custom configs to be the problem, I removed those configs. It did not seem to have an effect.
Then, I sought to reduce byte retention by setting retention.bytes=300000000000 (300gb) on udp_localhost-warning in the hopes that perhaps it simply needed a more aggressive retention policy. Doing this cleaned up far more than I expected by more than a TB (compare with T284233#7132254):
... 4.5G udp_localhost-warning-4 4.6G udp_localhost-warning-2 4.8G udp_localhost-warning-0 5.0G udp_localhost-warning-3 5.0G udp_localhost-warning-5 5.1G udp_localhost-warning-1 6.1G rsyslog-warning-0 6.1G rsyslog-warning-1 6.1G rsyslog-warning-2 6.1G rsyslog-warning-3 6.1G rsyslog-warning-4 6.1G rsyslog-warning-5 24G rsyslog-info-0 24G rsyslog-info-1 24G rsyslog-info-2 24G rsyslog-info-3 24G rsyslog-info-4 24G rsyslog-info-5 39G udp_localhost-info-0 39G udp_localhost-info-1 39G udp_localhost-info-3 39G udp_localhost-info-4 39G udp_localhost-info-5 40G udp_localhost-info-2 283G rsyslog-notice-0 283G rsyslog-notice-1 283G rsyslog-notice-2 283G rsyslog-notice-3 283G rsyslog-notice-4 283G rsyslog-notice-5
rsyslog-notice appears to have had the same problem.
382G rsyslog-notice-0 382G rsyslog-notice-1 382G rsyslog-notice-2 382G rsyslog-notice-3 382G rsyslog-notice-4 382G rsyslog-notice-5
After setting retention.bytes=300000000000 (300gb):
26G rsyslog-notice-2 27G rsyslog-notice-0 27G rsyslog-notice-1 27G rsyslog-notice-3 27G rsyslog-notice-4 27G rsyslog-notice-5