Page MenuHomePhabricator

Avoid accepting Kafka messages with whacky timestamps
Open, MediumPublic

Description

In https://phabricator.wikimedia.org/T250133#6063641 we encountered an error where a bad kafka timestamp caused kafka log rolling to stop indefinitely, which filled up disks.

Having a bad Kafka timestamp (way out of range, e.g. years in the future or past) will also hurt stream processing and Hive partition ingestion.

We could configure Kafka to reject messages with timestamps that are too old or two far in the future with log.message.timestamp.difference.max.ms. Setting this to the value of log.retention.ms seems to make the most sense, but this caused issues with compacted topics as noted here. Kafka had log.retention.ms as the default value for log.message.timestamp.difference.max.ms for a few versions but this was reverted to due complexities with compacted topics.

This really only matters when the data produced is untrusted. eventgate-analytics-external and eventgate-logging-external accept events from external producers. Our code does the right thing, but there is nothing stopping someone from manually POSTing an event with a whacky meta.dt, which will be used for the Kafka timestamp. After we do T267648: Adopt conventions for server receive and client/event timestamps in non analytics event schemas, we should probably modify EventGate so that it always sets meta.dt itself, rather than accepting the producer's value if it is present.

This would help mitigate the potential problem, but it doesn't stop bugs in our code from emitting bad timestamps. Setting log.message.timestamp.difference.max.ms would, but I'm not sure what to do if we start using compacted topics.

Event Timeline

I'd say this is medium to low priority and is something that needs to be worked on in collaboration with maintainers of other Kafka clusters.

Milimetric lowered the priority of this task from High to Medium.May 17 2021, 9:19 PM

This happened today, somehow there were recentchange events with timestamps from around 2007 in the kafka stream.

Happened again today. There was a mediawiki.recentchange event with a 2015 timestamp.

This is a nasty bug if Andrew happens to not be around, I just wanna ++ the tech debt value here.

There are 2 possible scenarios causing different issues:

  • Future timestamps that can block the Kafka's clean-up policy, filling disks.
  • Future or Past timestamps that could cause issues in Data processing pipelines.

    For the first one, it happens because the clean-up policy "delete" in Kafka only checks the first offset, and if it has passed the retention period, it is removed. If the first offset has a wrong date, for example, in the future, Kafka won't continue looking for other messages to be removed, and the topic will keep all the data until the "future date + retention" is reached, which can fill the disks.

    These 2 tickets https://phabricator.wikimedia.org/T267648, https://phabricator.wikimedia.org/T376026 already solved the biggest problem by setting the meta.dt manually rather than allowing clients to set any date.

    As commented here, reducing log.message.timestamp.difference.max.ms at broker level can cause issues in log compacted topics.

    Starting from Kafka v3.6.0, there is a new config (log.message.timestamp.after.max.ms) that can be set only for future events, which would solve the issue without affecting compacted topics.

    I guess upgrading from v1.1.0 to v3.6.0 is not a quick or simple solution for now.

    As an improvement, we can change the config log.message.timestamp.difference.max.ms in the topics created from logs, which are udp_localhost-info, udp_localhost-debug, udp_localhost-warning and, udp_localhost-err. Their cleanup.policy is "delete" and messages don't have keys in these topics, so they never will be compacted. And due to the nature of logs, there is no point on changing the policy to cleanup.policy "compact".

    This could be applied to other topics, although if Data pipelines are having issues with old timestamps, we might want to review each case as replying old messages could be a valid use case for data fixes. But I can apply the same config to some topics.

    @Ottomata, I'm not sure who can confirm this, it looks in general that this won't happen as the applications writing meta.dt are fixed, but I can change log.message.timestamp.difference.max.ms in the udp_localhost-* topics manually, so it won't happen on those in any case.
JMonton-WMF changed the task status from Open to In Progress.Oct 13 2025, 3:20 PM

After some conversations, we have decide to not continue with this ticket until the cluster is upgraded to version >=3.6.0. (https://phabricator.wikimedia.org/T300102)

At that version of the cluster, we'll set message.timestamp.after.max.ms to a desired value and the cluster will reject future messages exceeding that value.