Page MenuHomePhabricator

Increase Max Message Size in Kafka Jumbo
Closed, ResolvedPublic

Description

Increase Max Message size in Kafka Jumbo to 10MB (currently 4MB).

Why?

The Flink MW content enrichment job can't recover from an exception in the Sink (Kafka breaks on messages over ~4MB). To mitigate the chances of this happening again we need to increase the message size limit that Kafka can process.

Event Timeline

Change 952160 had a related patch set uploaded (by Btullis; author: Btullis):

[operations/puppet@production] Increase the kafka-jumbo maximum message size to 10 MB

https://gerrit.wikimedia.org/r/952160

I've created a patch to implement this and added some people as reviewers.

I'm going to have a look to see if there are other places where we need to inform kafka-jumbo clients of the increased maximum message size.

Oof! According to this: https://www.conduktor.io/kafka/how-to-send-large-messages-in-apache-kafka/#Broker-Side-0

We can override max.message.bytes on a per-topic basis, and it is possible for a topic's maximum message size to be greater than the message.max.bytes of the broker.

However, we must ensure that the replica.fetch.max.bytes value in the broker's config is larger than the size of any topic, otherwise any replicas will not be able to read it.

In our current puppet setup, we set these values to be the same as each other.
https://github.com/wikimedia/operations-puppet/blob/production/modules/confluent/templates/kafka/server.properties.erb#L46-L49

<% if @message_max_bytes -%>
message.max.bytes=<%= @message_max_bytes %>
replica.fetch.max.bytes=<%= @message_max_bytes %>
<% else -%>

So I would be a little reticent to start trying to modify replica.fetch.max.bytes to be larger than a value that isn't managed itself in puppet, but is applied manually.

[...]

So I would be a little reticent to start trying to modify replica.fetch.max.bytes to be larger than a value that isn't managed itself in puppet, but is applied manually.

+1.

As mentioned in slack, unmanaged topics settings make me uncomfortable too. Especially since these topics are versioned, and bumping version does not carry over settings.

@BTullis @elukey re the MW content enrichment job mentioned in the task:

Should we also enable compression on jumbo? Or would you rather our producer takes care of it?
The latter might be nice to save some bandwidth. Happy to move this to a dedicated phab if needed.

I see we have snappy compression enabled for MirrorMaker producers https://github.com/wikimedia/operations-puppet/blob/9bcf0640550b2eae76d144af64288649c1000799/modules/confluent/manifests/kafka/mirror/instance.pp#L89. How would this work in practice when an application writes directly into one of the two DCs?

Should we also enable compression on jumbo? Or would you rather our producer takes care of it?

I think that my first instinct would be to keep it in the producer configuration, rather than in the topic configuration.
As we mentioned before, we currently don't have a good configuration mechanism for managing per-topic configuration settings, other than doing so manually, so I think that coding it into the producer will be fine for these.

Will you be producing messages in batches, or individually? I believe that producer-side compression is particularly efficient when using batches.

Change 954690 had a related patch set uploaded (by Btullis; author: Btullis):

[operations/deployment-charts@master] Update the maximum message size in kafka for eventstreams

https://gerrit.wikimedia.org/r/954690

I have prepared a patch to update the eventstream configuration to consume larger messages. I guess that we should probably merge and deploy this first, before applying the change to kafka-jumbo. Would you agree @gmodena?

I looked at the eventgate deployments and I can see that the externally facing services set lower limits for message.max.bytes when producing messages, but I don't think that we need to adjust this value to account for the larger maximum possible size of message.

If I understand it correctly, the Mediwiki Event Enrichment is producing directly to kafka and bypassing eventgate altogether.

I checked gobblin and I can't find any other reference to the maximum message size that we would need to change, so I believe that this change is now ready to start deploying.

I have prepared a patch to update the eventstream configuration to consume larger messages. I guess that we should probably merge and deploy this first, before applying the change to kafka-jumbo. Would you agree @gmodena?

+1 :)

I looked at the eventgate deployments and I can see that the externally facing services set lower limits for message.max.bytes when producing messages, but I don't think that we need to adjust this value to account for the larger maximum possible size of message.

If I understand it correctly, the Mediwiki Event Enrichment is producing directly to kafka and bypassing eventgate altogether.

Seems to me as well that Flink produces directly to Kafka, respecting the schema and bypassing EventGate.

I checked gobblin and I can't find any other reference to the maximum message size that we would need to change, so I believe that this change is now ready to start deploying.

This is strange, I thought that we'd have needed to change some config. From what I can see, this code may indicate that there is a fetch size, but in here I see that the default is 1MB and we already pull (in theory) bigger messages since our limit is around 4MB. Maybe I am totally off, or the code is able to incrementally pull bigger messages.

Change 954690 merged by jenkins-bot:

[operations/deployment-charts@master] Update the maximum message size in kafka for eventstreams

https://gerrit.wikimedia.org/r/954690

Mentioned in SAL (#wikimedia-analytics) [2023-09-05T14:15:07Z] <btullis> deploying eventstreams-internal for T344688

Mentioned in SAL (#wikimedia-analytics) [2023-09-05T14:23:14Z] <btullis> deploying eventstreams for T344688

Change 954968 had a related patch set uploaded (by Btullis; author: Btullis):

[analytics/refinery@master] Increase the max kafka message size for gobblin

https://gerrit.wikimedia.org/r/954968

If I understand it correctly, the Mediwiki Event Enrichment is producing directly to kafka and bypassing eventgate altogether.

Seems to me as well that Flink produces directly to Kafka, respecting the schema and bypassing EventGate.

Correct. We do schema validation via eventutiliites, and bypass eventgate.

@BTullis @elukey re the MW content enrichment job mentioned in the task:

Should we also enable compression on jumbo?

Moved this question here for visibility https://phabricator.wikimedia.org/T345657

Change 954968 abandoned by Joal:

[analytics/refinery@master] Increase the max kafka message size for gobblin

Reason:

Change not actually needed.

https://gerrit.wikimedia.org/r/954968

Mentioned in SAL (#wikimedia-analytics) [2023-09-19T09:27:56Z] <btullis> deploying change to kafka-jumbo settings for T344688

Change 952160 merged by Btullis:

[operations/puppet@production] Increase the kafka-jumbo maximum message size to 10 MB

https://gerrit.wikimedia.org/r/952160

This is deployed and all of the brokers in the kafka-jumbo cluster have been restarted.
I'll leave it to @gmodena to test that the settings have been satifactorily applied before resolving the ticket.

I'll leave it to @gmodena to test that the settings have been satifactorily applied before resolving the ticket.

Changes LGTM from the producer side.

Change 960610 had a related patch set uploaded (by Aqu; author: Aqu):

[operations/deployment-charts@master] Bump MW Page content change app version

https://gerrit.wikimedia.org/r/960610

I'm going to resolve this, since the kafka side is done. We are still awaiting a redeployment of the MW Page content change app to make useof the new maximum size, but others are taking care of that deployment in: https://gerrit.wikimedia.org/r/960610