Page MenuHomePhabricator

Unexpected utilization increase in udp_localhost-info kafka-logging topic
Closed, ResolvedPublic

Description

kafka-logging1004 has started alerting for its filesystem utilization thresholds, turns out udp_localhost-info topic has seen a linear increase in space used https://grafana.wikimedia.org/goto/dFkB3MONg?orgId=1

2025-01-20-164639_3808x1732_scrot.png (1×3 px, 283 KB)

As far as I can tell this is mediawiki logging from k8s, the question in my mind is what changed for such an increase?

Event Timeline

I enabled logging for mw-jobrunner through rsyslog on the 13th https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/1110786 but it looks like the increase of udp_localhost-info preceeds that a bit?

rsyslog container was added to mercurius on the 7th https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/1105800 but it again doesn't line up with the beginning of the slope

Thank you @Clement_Goubert ! I'm looking at the topic activity and expecting a similar increase in e.g. bytes or messages received, and I couldn't find any: https://grafana.wikimedia.org/goto/KCYj8VONR?orgId=1

2025-01-21-094736_3815x1586_scrot.png (1×3 px, 437 KB)

Which makes me suspicious this could be a repeat of kafka not cleaning up due to messages with wacky timestamps (cfr https://phabricator.wikimedia.org/T250133#6063641 and https://phabricator.wikimedia.org/T284233 for example)

Mentioned in SAL (#wikimedia-operations) [2025-01-21T09:47:25Z] <godog> set udp_localhost-info retention.bytes=100000000000 on kafka-logging - T384233

Mentioned in SAL (#wikimedia-operations) [2025-01-21T10:00:32Z] <godog> set udp_localhost-info retention.bytes=300000000000 on kafka-logging (back to original value) - T384233

jijiki changed the task status from Open to In Progress.Jan 21 2025, 10:32 AM
jijiki triaged this task as Medium priority.
fgiunchedi changed the task status from In Progress to Stalled.Jan 21 2025, 10:40 AM

Wiggling the retention.bytes config for udp_localhost-info (as per https://wikitech.wikimedia.org/wiki/Kafka/Administration#Alter_topic_retention_settings) seems to have done the trick, in the sense that space has been cleaned. Stalling the task and waiting a couple of days to see if retention keeps getting applied as expected.

fgiunchedi claimed this task.

Topic retention now works properly for udp_localhost-info, resolving

I'm curious, is it a known issue with our version of kafka that the retention settings need wiggled occasionally? Could this symptom arise because of a default setting we missed during topic provisioning?

good question, my understanding is that the problem is caused by messages with wacky timestamps that prevent retention to be applied properly (see also https://phabricator.wikimedia.org/T250133#6063641 and related) I might be offbase though and newer kafka versions won't have this problem.